Stand over there, would you, while I throw this wellington boot. I want you to see how well I throw it. Pay attention: you need to judge me on my welly-throwing. Oops, that throw wasn’t very good! Let’s not count that. Ah, the second throw was better. OK, now my assistant will measure how far it went. No – him, not you. It’s actually quite hard to measure it properly – the tape has to be taut, so I have to secure it in the ground here – and I’ve not learnt to do that properly. Anyway, a bit of slack is all to the good! We’ll use this tape-measure which we made: it uses a special unit of distance which we invented.
This, I suspect, is uncomfortably close to how charities’ monitoring and evaluation work. Charities get judged on ‘evaluations’ which they themselves produce, for which they design measures, and they decide whether and what to publish. It appears not to help them much. If the aim is to improve decisions – by operating charities, by funders, by policy-makers – by enabling access to reliable evidence about what’s worth doing and what’s worth prioritising, then much of it fails: it’s just too ropey.
This article first appeared in Third Sector. A pdf of it is here.
This needs to stop. It wastes time and money, and – possibly worse – pulls people towards bad decisions. My aim here isn’t to just bitch, but rather to honestly present some evidence about how monitoring and evaluation actually works currently, and make some suggestions about creating a better set-up.
Why are we evaluating?
When asked in 2012 what prompted their impact measurement efforts, 52% of UK charities and social enterprises talked about funders’ requirements. Despite being social-purpose organisations, the number which cited ‘wanting to improve our services’ was a paltry 7%[i].
A study by two American universities indicates the incentives which influence charities’ evaluations. In a randomised controlled trial, the universities contacted 1,419 micro-finance institutions (MFIs) offering to rigorously evaluate their work. (It was a genuine offer.) Half of the invitations referenced a (real) study by prominent researchers indicating that microcredit is effective. The other half of the invitations referenced another real study, by the same researchers using a similar design, which indicated that microcredit has no effect.
The letters suggesting that microfinance works got twice as many positive responses as those which suggested that it doesn’t work.[ii] Of course. The MFIs are selling. They’re doing evaluations in order to bolster their case. To donors.
Hence it’s little surprise if evaluations which don’t flatter aren’t published. I myself withheld unflattering research when I ran a charity (discussed here). Withholding and publication bias are probably widespread in the voluntary sector – Giving Evidence is starting what we believe to be the first ever study of them – preventing evidence-based decisions, and wasting money.
If charities are wanting (or forced, by the incentives set up for them) to do evaluations which flatter them, they’re likely to choose bad research methods. Consider a survey. If you survey 50 random people, you’ll probably hear representative views. But if you choose which 50 to ask, you could choose only the cheery people. Furthermore, bad research is cheaper: surveying five people is cheaper than surveying a more statistically significant 200. A charity in criminal justice told me recently of a grant from a UK foundation “of which half was for evaluation. That was £5,000. I said to them that that’s ridiculous, and kind of unfair. We obviously can’t do decent research with that.”
Charities’ research is often poor quality. The Paul Hamlyn Foundation, assessed the quality of research reports it received from its grantees over several years, and graded them: good, ok, not ok. The scale it used was much more generous than how medics grade research. Even so, 70% was not good. Another example is the Arts Alliance’s library of evidence by charities using the arts in criminal justice. About two years ago, it had 86 studies. When the government offender management service look at that evidence for a review which had minimum quality standard, how many of those studies could it use? Four. The new ‘what works centre’ for crime reduction found much the same. It searched for all systematic reviews about crime reduction [systematic reviews compile all findable evidence above a stated quality threshold] and found 337. Giving Evidence asked about the contribution of research by charities to them, and they said it was ‘very small’. One charity CEO we interviewed recently blurted it right out:
“When I first started in this [sector], I kept talking about evaluation and he [senior person in the charity sector] said to me ‘don’t worry about that. You can just make it up. Everybody else does. At the very least you should exaggerate a lot. You’ll have to, to get funded.”
“Ask an important question and answer it reliably”
This is a central tenet of clinical research. Though it sounds obvious, it isn’t what happens in our sector. On reliability, much research by charities fails as discussed. It’s inevitable because investigating causal links is hard. Most charities don’t have those skills. Given the fragmentation (the UK has 1475 NGOs in criminal justice alone) you wouldn’t want them all to hire a researcher.
And on importance, charities’ research often seems to fall short there too. 65% of CEOs of US foundations say that generating meaningful insights from evaluations is ‘a challenge’[iii].
The collective spend on evaluation in the US is 2% of total grant-making[iv]. That proportion of UK grants would be £92m. That’s easily enough for many pieces of reliable research, but split between loads of organisations and into pieces of £5000, it can only generate garbage. It’s as though we’re mountaineering, and everybody gets into the foothills but nobody reaches the summit. Everybody tickles the question but no-one nails it.
It’s wasteful and it should stop.
We need one other thing too. Almost all decisions are between options: this intervention versus that one, for example. To enable evidence-based decisions, evaluations must enable comparisons. So it’s no good if everybody designs their own tape-measure: a survey of 120 UK charities and social enterprises found over 130 measurement tools in-play[v]. We need standardised metrics. These needn’t be some impossible universal measure of human happiness, but could be standardised within specialisms such as some types of mental health care, or job creation or back-to-work programmes.
Cite evidence, don’t produce it
When I get in an aeroplane, I do not wish my flight to be in a rigorous trial to conclusively prove whether the plane will stay up or not: I want to know that that’s been established already. If an intervention is innovative – say I’m having a new medical drug – then obviously it won’t yet have been fully evaluated, but it’s reasonable to ask that the practitioner can cite some evidence that this intervention isn’t bonkers: maybe it’s a variation on a known drug, or other research suggests a plausible causal mechanism.
We should do more of this in our sector. We should expect organisations to cite research which supports their theory of change; but we don’t need every single organisation to produce research.
Imagine that you’re considering starting a breakfast club in a school. Should you do an evaluation? The table below explains.
Answer: no! The first thing you do is look at the literature to see what’s already known about whether they work. To be fair, ‘the literature’ is currently disorganised, unclear and tough to navigate (hence Giving Evidence is working on that – more detail soon), but ideally you’d look at research by other charities and academics and others.
If that research is reliable and shows that the clubs don’t work, then obviously you stop.
If that research is reliable and shows that clubs do work, then just crack on. The evaluation has already been done and you don’t need to duplicate it: by analogy, we don’t expect every hospital to be a test site. You can just cite that evidence, and monitor your results to check that they’re in-line with what the trials predict. (If not, that suggests a problem in implementation, which a process evaluation can explore.)
This of course is different to what happens now: in the model I’m suggesting, in the circumstances described, you will never have a rigorous evaluation of your breakfast club. Just as most cancer patients will never be in a rigorous trial, and you never want to be in one by an airline. But you will (a) have a sound basis for believing that your club improves learning outcomes (in fact, a much better basis than if you’d attempted an evaluation, like our friend earlier, with just £5000) and (b) won’t have spent any time or money on evaluation. Of course, this model requires funders, commissioners, trustees and others to sign up to the ‘cite, don’t necessarily produce’ model of research, which I realise isn’t trivial. They too would look for evidence before they fund, rather than looking just at the ‘monitoring and evaluation’ which emerge afterwards. The Children’s Investment Fund Foundation, for example, reviews the literature relevant to any application it’s considering.
Under this model, many fewer evaluations happen. Those few can be better.
If your literature search finds no evidence because it’s a novel idea, then look at relevant literature (see column on page x), run a pilot, as described, and if it works, eventually decide whether to do a ‘proper’ evaluation.
Then arise the questions of who does that evaluation and who pays for it. I don’t have all the answers (and am interested in your ideas), but a few ‘points of light’ are clear.
First, the evaluation shouldn’t funded by the charity. It’s a public good other people will use it too, so it’s unfair to ‘tax’ the first-mover by making them fund it. In international development, many institutions want to use reliable evaluations but few are willing to pay, so many of them are funded centrally as a public good, through the International Initiative on Impact Evaluation (3ie), essentially a pooled fund from the Gates Foundation, Hewlett Foundation and you, the UK tax-payer, through DFID. [As an aside, almost every sophisticated thing in international development has DFID involvement somewhere.]
Second, the budget for the evaluation has nothing to do with the size of the grant. If the question is important, it should be answered reliably even if that’s expensive. If adequate budget isn’t available, don’t evaluate at all: a bad answer can be worse than no answer and is just wasteful. Let’s not tickle questions.
Third, the evaluation shouldn’t be conducted by the charity – for the reasons of skill and incentives we’ve seen. The obvious answer is academics, but sadly their incentives aren’t always aligned with ours: their funding and status rest on high-profile journal articles, so (a) they might not be interested in the question and (b) their ‘product’ can be impenetrably theoretical (and may be paywalled). Several people in the last month have suggested that young researchers – PhD students and post-docs – with suitable skills may be the answer: some system to broker them in to charities whose work (genuinely) needs evaluating, rather as Project Oracle is doing with some charities in London.
Does evaluation preclude innovation?
No. You can tell that it doesn’t because the model in which charities cite research, but don’t always produce research, is essentially what happens in medicine where there’s masses of innovation. In fact, reliable evaluation is essential to innovation because reliable
evaluations show which innovations are worth keeping. They also show what’s likely to work. Few things are totally new. Most build on something already known. Suppose that you have a new programme for countering workplace gender discrimination. It relies on magic fairies visiting people at night. Well that’s interesting, because there’s no evidence of magic fairies in the whole history of time. Thus there’s no evidence to support the notion that this programme will work.
By contrast, suppose that your programme assumes that they will follow the crowd, shy away from complicated decisions and are weirdly interested in hanging on to things they
already own. Those three traits of human behaviour are very well-established – Daniel Kahneman was awarded a Nobel prize for proving the latter and substantial evidence for them all is in his book, Thinking, Fast and Slow.
At the outset, you won’t have any evaluations of your particular programme, but you can cite evidence that it’s not bonkers. We’re not talking here about proof, clearly, but rather
about empirically-driven reasons to believe. What gives you to think that it’ll work? What else is similar which works elsewhere? What assumptions does the programme make about human behaviour or organisations or political systems and what evidence supports those assumptions?
Hence the “cite research, don’t necessarily produce research” model reduces the risk of funding or implementing innovations which fail, and thereby wasting time, money and
opportunity. It allows us to stand on the findings of many generations and disciplines and, hence, see better whether our innovation might work. We might call this “evidence-based
On our guard
If there is no evidence, that doesn’t prove that the programme won’t work – but it should put us on our guard. The Dutch have a great phrase: “comply or explain”. If your innovative idea doesn’t comply with the existing evidence, then you have more explaining to do than if it does.
For example, to improve exam results, various economists handed schoolchildren a $20 note at the exam hall door. It sounds crazy. The students were told to hand the money back if they didn’t do well. Now, suddenly, it sounds sensible. This innovation is informed by Kahneman’s finding that people will work hard to retain something they already own – harder than they would work to gain that thing in the first place.
Context is, of course, important. Perhaps the evidence came from a time or place that is materially different and hence doesn’t apply – or, at least, requires a bit of translation to here and now. Hence, innovations might be evidence-informed, rather than proven.
And once your new gender programme is running, we need to see whether it really works – not just whether it looks as if it’s working. For that, we need rigorous evaluations.
What’s evaluation, what isn’t and what to do when?
“Evaluation is distinguished from monitoring by a serious attempt to establish causation”, says Michael Kell, chief economist at the National Audit Office.
That research is not needed all the time. For service delivery, the types of research which are useful at various stages of a programme’s development are as follows, taking the example of a school breakfast club:
|Stage of programme development||Purpose of the stage, and useful information to gather||Application to breakfast club|
|Pilot||Establish if the programme is feasible, if there is demand, the resource requirements (time, people, cost), management challenges and costs.
Type of research: monitoring.
|How much cereal is needed, do children and parents want it, how many staff and time are needed to wash up, how much it all costs?|
|Test||Now that the programme is stable and manageable, investigate whether the inputs cause the intended outcomes.
Type of research: evaluation, ideally rigorous e.g., with an equivalent control group, and conducted and funded independently. Most programmes need several evaluations, in diverse circumstances.
|(How) does a breakfast club improve learning outcomes?|
|Scale-up/Delivering Services||Now the programme is known to be effective and can be scaled up. We don’t need to evaluate it again, so can just monitor it to ensure that it’s working as expected.
Type of research: monitoring.
|Are the changes in learning outcomes in-line with results from the trials? If not, something may be awry in implementation.
Monitor beneficiary views, uptake, measurable results (e.g. test scores), and cost.
Monitoring and evaluation of research and development work, and of advocacy both work rather differently.
This table does not look at process evaluation, which is separate (and highly useful). That aims to understand whether the intervention was actually delivered as planned; variations in cost, quality, staffing etc.; and to identify operational improvements.
 This remains a terrible problem in medical research. For example, a study of 2000 studies of schizophrenia found 640 different measurement instruments, of which 369 were used only once.
[i] Making an Impact: Impact Measurement Across Charities and Social Enterprises in the UK, NPC, October 2012
[ii] [i] Findlay, M. Aversion to Learning in Development? A Global Field Experiment on Microfinance Institutions. [Online] http://www.michael-findley.com/uploads/2/0/4/5/20455799/mfi_learning.22mar13.pdf [Accessed on: 24.09.14]