tion to concerns about the internal validity of impact evaluation estimates, concerns may arise
about external validity, and these concerns arise irrespective of the evaluation methodology
adopted. External validity concerns the extent to which results derived from a specific evalua-
tion study can be generalized to other contexts, and whether lessons can be taken away for the
future. In particular, can one expect the same outcomes once the programme is scaled up, and
can policy makers base their own decisions on the introduction of new policies and programmes
on the experience of previous interventions in other contexts?
There are a number of reasons why the answer to such questions may be no. The first relates
to the fact that estimates for an evaluation study will only produce partial equilibrium effects,
and these may be different from general equilibrium effects (Heckman, Lochman and Taber,
1998). In other words, the scale of the programme may affect estimated treatment effects. If an
intervention study is limited to a specific region or area, or if participation is means-tested in
some way, then taking that same programme and replicating it at the national level may lead to
very different results. This concern will be even more justified if the success of the intervention
is tied to the existence of specific institutions. For example, if a specific intervention rests on
the activities of a local NGO, then the impact when the programme is scaled up to the national
level may be quite different (Duflo and Kremer, 2005). Moreover, scaling a programme up
to the national level may alter the way that markets work, thereby affecting the operation of
the programme itself. For example, a wage subsidy programme tested at a local level may
show promising results, but when this same intervention is scaled nationally, it may alter the
operation of labour markets, and produce a different outcome (Ravallion, 2008).
Scaling up may also fail if the socio-economic composition of local participants differs from
the national demographic profile. Randomised interventions tested at a local level tend to
under-estimate how pro-poor a programme will be, since initial benefits of an intervention tend
to be captured by local elites (Lanjouw and Ravallion, 1999). However, as the programme is
scaled up, the incidence of benefits tends to become more pro-poor as the benefits are extended
to greater numbers of individuals (Ravallion, 2004a).
An obvious difficulty of thinking about how generalisable the results from a specific inter-
vention are is that the counterfactual is typically posed in terms of how participants would have
fared in the absence of the intervention. However, policymakers are typically trying to choose
amongst alternative programmes, not between whether to intervene or not. Hence, while a
specific intervention may fare well against a counterfactual of no intervention, it need not be
the case that the same intervention would fare as well when compared against a different policy
option.
Concerns over external validity may be ameliorated to the extent that interventions are
replicated in different settings and at different scales (Duflo and Kremer, 1995; Duflo, 2003).
The results from these replication studies provide evidence on the extent to which results can
be generalized. Since different contexts will require adaptations and changes to programmes,
the robustness of the programme or intervention is revealed by the extent to which it survives
these changes. Moreover having multiple estimates of programme estimates in different settings
gives some sense of how generalisable the results really are. For example, the findings from the
mass deworming intervention in Kenya reported by Miguel and Kremer (2004) were largely
vindicated in a study in India, reported by Bobonis, Miguel and Sharma (2002), despite the
fact that the Indian programme was modified to include iron supplementation.
Concerns also arise over the length of the evaluation period. To the extent that the evalua-
tion period coincides with the project period, any impacts that continue after the completion
of the pro ject or only materialize in the long run will fail to be captured in the evaluation. In
short, there may be significant lags in outcome responses to an intervention. With health care
programs as an example, the interventions will only have effects once better health care out-
comes (BMI, height-weight ratios, incidence of absenteeism, etc) can be definitively measured.
Thus the length of the program hinges on what the outcome variable of concern is, and whether
there is sufficient time in the program for there to be a change in the outcome variable. One
solution to this concern is to design an intervention to include the tracking of participants for
a significant period of time, perhaps even after the programme or intervention has ended. Of
12