student performance measure. With this approach, the average score of a particular country
will depend on the achievement of the students in the other participating countries. Thus, the
test scores for a particular country are not necessarily comparable over time, in particular
since the country sample changes. More importantly, for the tests prior to 1991, “Warm
estimates” were not calculated, so that we have to rely on the share of correct answers for
these tests.7
To make the scores on the different international tests comparable on a common metric, we
have re-scaled the average scores for each international test by the following procedure. First
we calculate the average of the Mathematics and Science tests when both subjects are tested.
Second, we standardize the average score for each test to have mean zero and standard
deviation equal to unity for a “core” group of 15 countries. The “core” is defined as the
countries that have participated in at least six out of the eight international tests reported in
Table 1, namely Australia, Canada, Hong Kong, Hungary, Israel, Italy, Japan, Korea,
Netherlands, New Zealand, Russia, Sweden, Thailand, UK, and USA.8 Third, we re-scale the
scores for each of the other countries using the same parameters as for the “core” countries.
Finally, since there are two tests for many countries in 2003 (TIMSS and PISA), we calculate
the average of those tests in 2003.
Making the results from different tests comparable across time has been a challenge also for
previous empirical studies. For example, Hanushek and Kimko (2000) calculate a measure of
labor-force quality based on the percent of correct answers in international student
achievement tests for the period 1965-1991. They adjust the mean for each test, but not the
variance (except the linear scaling that follows from the adjustment of the mean). Adjusting
the means across tests is crucial in their analysis because they subsequently calculate an
aggregated 30-year average quality measure for each country. More recently, Hanushek and
Wossmann (2007) utilize for their cross-section of national student performance tests from
TIMSS, PISA and the IEA up to 2003 and, in addition to adjusting the means, they correct the
dispersion of each single test in a similar way as ours.9
Figure 1a shows that the density of our measure of student achievement across the 15 “core”
countries observations is close to the normal distribution. The density for all observations
7 We have compared the Warm estimates and percent correct answers for the IEA tests in 1994-95 and 1998-99
for which both measures are available. The correlation coefficients for Mathematics are 0.997 and 0.982,
respectively, and for Science 0.994 and 0.977, respectively. Thus, the differences across countries do not seem to
be influenced in any important way by the choice of scale.
8 More precisely, we standardize the score for those of the “core” countries that participated in the particular test.
Out of the 15 “core” countries used to standardize the test scores, the data sources reports results for 11 countries
in 1980-81, 12 in 1983-84, 8 in 1990-91, 15 in 1994-1995, 14 in 1998-99, 15 in OECD 2000, 13 in TIMSS 2003
and 13 in OECD 2003. Only USA has test scores for all tests.