After all, if it were found that competence in algebra was independent of competence in
geometry, this does not mean that we can conclude that competence in reading is
independent of competence in writing.

The research that is needed is actually quite straightforward to undertake—my
conjecture is that the levels of performance in untested aspects of the national
curriculum in each subject should decline while levels in the tested aspects should stay
the same, or rise, as has been found elsewhere (Linn, 1994).

The reliability of national curriculum assessments

The main thrust of my arguments with regard to the reliability of national curriculum
assessments (and GCSE and A-level examinations) is that data on the reliability of
state-mandated assessments should be made available routinely, and that such data
should be presented in a form that reflects how the results of the assessments are
actually used. In this context, it is worth noting that in the 1970s the examination boards
were happy to admit that A-level grades were accurate to at most one grade either way.

In respect of the latter point, I have argued that traditional definitions of reliability, as a
form of ‘signal-to-noise’ ratio designed for continuous variables, create an unwarranted
sense of security when used to describe assessments that are reported on discrete scales
that are used to support dichotomous decisions. To illustrate this, table 1 (taken from
Wiliam, 2000) shows how the reliability of an assessment system looks very different
when presented as a classical reliability coefficient and in the form of the number of
students getting their ‘correct’ grades (in the sense of the grade corresponding to their
true score) when outcomes are reported on an 8-grade scale .

Reliability .60 .70 .80 .90 .95 .99

Grading accuracy 40% 45% 52% 65% 75% 90%

Table 1: impact of reliability of marking on accuracy of grading for an 8-grade scale

Newton regards ‘misclassification’ as a “highly problematic concept”, presumably
because he regards ‘classification’ as equally problematic. However, as long as we
accept the notion that, for a given assessment, a particular student will have a ‘true
score’ (defined as the long-run average of the scores on repeated takings of the same or
parallel tests without learning in between), then a student will have a true level or grade.
For students whose true score is close to a level boundary, even if the test is highly
reliable (ie yields fairly consistent scores for an individual) then they will sometimes get
a level other than their true level. Of course, as Newton suggests, the fact that someone
gets 26 marks as opposed to 27 marks doesn’t mean much in itself, but, if this means
that they get a level 3 rather than a level 4 at the end of key stage 2, it is serious—the
student may be punished by parents, expectations of the student may be revised
downwards, and the student is regarded as in need of remediation in their secondary
school, given ‘booster’ classes and required to repeat the end-of-year 6 test at the end of
year 7. Perhaps, with a better understanding of errors of measurement, things would be
better, but as long as marks on tests are used to make dichotomous decisions, then I
maintain that our measure of reliability should be the accuracy of the decisions.

In a final comment on this issue, Newton suggests that the use of tasks or tests might
well result in lower reliability (for the scores for individuals) than with the existing tests
—this is absolutely right, but it doesn’t matter because these scores are reported and
used only at the group level, so that the reliability is close to 100%. Newton points out
that the same would be true if the existing tests were reported only at whole-class or

More intriguing information

1. An alternative way to model merit good arguments
2. Auctions in an outcome-based payment scheme to reward ecological services in agriculture – Conception, implementation and results
3. The name is absent
4. The name is absent
5. Conservation Payments, Liquidity Constraints and Off-Farm Labor: Impact of the Grain for Green Program on Rural Households in China
6. Are Japanese bureaucrats politically stronger than farmers?: The political economy of Japan's rice set-aside program
7. Income Taxation when Markets are Incomplete
8. Palvelujen vienti ja kansainvälistyminen
9. Climate Policy under Sustainable Discounted Utilitarianism
10. Ahorro y crecimiento: alguna evidencia para la economía argentina, 1970-2004

CorPapers

National curriculum assessment: how to make it better

The reliability of national curriculum assessments

More intriguing information