What every teacher needs to know about… the effect size

  • What every teacher needs to know about… the effect size

There’s a slew of research out there telling teachers what to teach and how to teach it. But when you’ve got a Year 8 lesson to plan and a pile of marking to hanging over you, who’s got time to make sense of it all?

Luckily the Education Endowment Foundation, the Sutton Trust, John Hattie’s Visible Thinking and a range of other jostling enterprises have done the hard work for us. They’ve filleted all those turgid research papers into handy, bite-size gobbets and labelled them all with an Effect Size which tells us how much progress our students will make if we adopt them.

The Effect Size (ES) allows us to quantify the magnitude of the difference between two groups. Armed with this measurement we have decided we can move beyond simply stating that an intervention works, to the more sophisticated consideration of how well it works compared to other interventions. But does the Effect Size give us an accurate and valid measure of difference?

Digging deeper

In order to answer this question we need to know to what an ES actually corresponds. An ES of 0 means that the average treatment participant outperformed 50% of the control participants. An effect size of 0.7 means that the average participant will have outperformed the average control group member by 0.7 standard deviations (this means that an average person in the experimental group would be at the 76th percentile of the control group, or, to put it another way if you divided 100 students into 4 ‘ability sets’, the average student in the experimental group would just about get into set 1). The baseline is that a year’s teaching should translate into a year’s progress and that any intervention that produces an ES of 0.4 is worthy of consideration. Hattie went about aggregating the effects of thousands of research studies to tell us how great an impact we could attribute to the various interventions and factors at play in classrooms.

This is what he found:

Influence Effect Size Source of Influence

Feedback 1.13 Teacher

Student’s prior cognitive ability 1.04 Student

Instructional quality 1.00 Teacher

Direct instruction .82 Teacher

Acceleration .72 Student

Remediation/feedback .65 Teacher

Student’s disposition to learn .61 Student

Class environment .56 Teacher

Challenge of Goals .52 Teacher

Peer tutoring .50 Teacher

Mastery learning .50 Teacher

Homework .43 Teacher

Teacher Style .42 Teacher

Questioning .41 Teacher

Peer effects .38 Peers

Advance organisers .37 Teacher

Simulation & games .34 Teacher

Computer-assisted instruction .31 Teacher

Testing .30 Teacher

Instructional media .30 Teacher

Affective attributes of students .24 Student

Physical attributes of students .21 Student

Programmed instruction .18 Teacher

Audio-visual aids .16 Teacher

Individualisation .14 Teacher

Finances/money .12 School

Behavioural objectives .12 Teacher

Team teaching .06 Teacher

Physical attributes (e.g., class size) –.05 School

Table 1 Effect Sizes of Educational Interventions

So now you know: giving feedback is ace, questioning is barely worth it, and adjusting class size is pointless. You might well have a problem with some of these findings but let’s accept them for the time being.

Hattie says:

An effect-size of d=1.0 indicates an increase of one standard deviation… A one standard deviation increase is typically associated with advancing children’s achievement by two to three years, improving the rate of learning by 50%, or a correlation between some variable (e.g., amount of homework) and achievement of approximately r=0.50. When implementing a new program, an effect-size of 1.0 would mean that, on average, students receiving that treatment would exceed 84% of students not receiving that treatment (John Hattie,Visible Learning)

Really? So if ‘feedback’ is given an effect size of 1.13 are we really supposed to believe that pupils given feedback would learn over 50 percent more than those who are not? Is that controlled against groups of pupils who were given no feedback at all? Seems unlikely, doesn’t it? And what does the finding that direct instruction has an ES of .82 mean? I doubt forcing passionate advocates of discovery learning to use direct instruction would have any such effect.

Mix and match

At this point it might be worth unpicking what we mean by meta-analysis. The term refers to statistical methods for contrasting and combining results from different studies, in the hope of identifying patterns, sources of disagreement, or other interesting relationships that may come to light from poring over the entrails of qualitative research.

The way meta-analyses are conducted in education has been nicked from clinicians. But in medicine it’s a lot easier to agree on what’s being measured: e.g. are you still alive a year after being discharged from hospital? Lumping the results from different education studies together tricks us into assuming different outcome measures are equally sensitive to what teachers do. Or to put it another way, that there is a standard unit of education. Now, if we don’t even agree what education is for, being unable to measure the success of different interventions in a meaningful way is a bit of stumbling block.

And then to make matters worse, there are at least two serious problems with effect sizes: the range of achievement and sensitivity to instruction.

Firstly, the range of achievement of pupils studied influences effect sizes. Older children will show less improvement than younger children because they’ve already done a lot of learning and improvements are now much more incremental. If studies are comparing the effects of inventions with six year olds and sixteen year olds and are claiming to measure a common impact, their findings will be garbage.

The second problem is how do we know there’s any impact at all? To see any kind of effect we usually rely on measuring pupils’ performance in some kind of test. But assessments vary greatly in the extent to which they measure the things that educational processes change. A test can be made more reliable by getting rid of questions which don’t differentiate between pupils – so if all pupils tend to get particular questions right or wrong then they’re of limited use. But this process changes the nature of tests: it may be that questions which teachers are good at teaching are replaced with those they’re not so good at teaching. This might be fair enough - except how then can we possibly hope to measure the extent to which pupils’ performance is influenced by particular teacher interventions? The effects of sensitivity to instruction are a big deal. For instance, Bloom claimed that one-to-one tutorial instruction is more effective than average group-based instruction by two standard deviations. This is hardly credible. In standardised tests one year’s progress for an average student is equivalent to one-fourth of a standard deviation, so one year’s individual tuition would have to equal nine years of average group-based instruction! Hmm? The point is, the time lag between teaching and testing appears to be the biggest factor in determining sensitivity to instruction. Outcome measures used in different studies are unlikely to have the same view of sensitivity to instruction.

Figure it out

In meta-analyses there’s little attempt to control for these problems. As long as studies make the duration of the trial clear careful researchers can, and do, include the duration of the intervention as a moderating variable. This doesn’t mean we shouldn’t trust that those things Hattie puts at the top of his list don’t have greater impact than those at the bottom, but it does mean we should think twice before bandying about effect sizes as evidence of potential impact.

Here’s the kicker: when numerical values are assigned to teaching we’re very easily taken in. The effects of teaching and learning are far too complex to be easily understood, but numbers are dead easy to understand: this one’s bigger than that. This leads to pseudo-accuracy and makes us believe there are easy solutions to difficult problems. Few teachers (and I certainly include myself here) are statistically literate enough to properly interrogate this approach. The table of effect sizes with its beguilingly accurate seeming numbers has been a comfort; someone has relieved us of having to think. But can we rely on these figures? Do they really tell us anything useful about how we should adjust our classroom practice?

On their own, no. But if we’re sufficiently cautious and see effect sizes as a very imprecise way to make a rough comparison; if we triangulate with the findings of cognitive science and the evidence of experienced teachers then we might allow that an intervention is probably worth trying. After all, evidence is meant to be helpful. As ever, a mix of healthy scepticism and a willingness to think really helps when looking at research evidence; assigning numerical values and establishing hierarchies of effective interventions is only misleading.

This piece is based on an extract from David’s forthcoming book, What if Everything you Knew about Education was Wrong?