National curriculum assessment: how to make it better



After all, if it were found that competence in algebra was independent of competence in
geometry, this does not mean that we can conclude that competence in reading is
independent of competence in writing.

The research that is needed is actually quite straightforward to undertake—my
conjecture is that the levels of performance in untested aspects of the national
curriculum in each subject should decline while levels in the tested aspects should stay
the same, or rise, as has been found elsewhere (Linn, 1994).

The reliability of national curriculum assessments

The main thrust of my arguments with regard to the reliability of national curriculum
assessments (and GCSE and A-level examinations) is that data on the reliability of
state-mandated assessments should be made available routinely, and that such data
should be presented in a form that reflects how the results of the assessments are
actually used. In this context, it is worth noting that in the 1970s the examination boards
were happy to admit that A-level grades were accurate to at most one grade either way.

In respect of the latter point, I have argued that traditional definitions of reliability, as a
form of ‘signal-to-noise’ ratio designed for continuous variables, create an unwarranted
sense of security when used to describe assessments that are reported on discrete scales
that are used to support dichotomous decisions. To illustrate this, table 1 (taken from
Wiliam, 2000) shows how the reliability of an assessment system looks very different
when presented as a classical reliability coefficient and in the form of the number of
students getting their ‘correct’ grades (in the sense of the grade corresponding to their
true score) when outcomes are reported on an 8-grade scale .

Reliability                 .60             .70             .80             .90             .95             .99

Grading accuracy       40%        45%        52%        65%        75%        90%

Table 1: impact of reliability of marking on accuracy of grading for an 8-grade scale

Newton regards ‘misclassification’ as a “highly problematic concept”, presumably
because he regards ‘classification’ as equally problematic. However, as long as we
accept the notion that, for a given assessment, a particular student will have a ‘true
score’ (defined as the long-run average of the scores on repeated takings of the same or
parallel tests without learning in between), then a student will have a true level or grade.
For students whose true score is close to a level boundary, even if the test is highly
reliable (ie yields fairly consistent scores for an individual) then they will sometimes get
a level other than their true level. Of course, as Newton suggests, the fact that someone
gets 26 marks as opposed to 27 marks doesn’t mean much in itself, but, if this means
that they get a level 3 rather than a level 4 at the end of key stage 2, it is serious—the
student may be punished by parents, expectations of the student may be revised
downwards, and the student is regarded as in need of remediation in their secondary
school, given ‘booster’ classes and required to repeat the end-of-year 6 test at the end of
year 7. Perhaps, with a better understanding of errors of measurement, things would be
better, but as long as marks on tests are used to make dichotomous decisions, then I
maintain that our measure of reliability should be the accuracy of the decisions.

In a final comment on this issue, Newton suggests that the use of tasks or tests might
well result in lower reliability (for the scores for individuals) than with the existing tests
—this is absolutely right, but it doesn’t matter because these scores are reported and
used only at the group level, so that the reliability is close to 100%. Newton points out
that the same would be true if the existing tests were reported only at whole-class or



More intriguing information

1. An Investigation of transience upon mothers of primary-aged children and their school
2. The name is absent
3. The Composition of Government Spending and the Real Exchange Rate
4. European Integration: Some stylised facts
5. he Virtual Playground: an Educational Virtual Reality Environment for Evaluating Interactivity and Conceptual Learning
6. Categorial Grammar and Discourse
7. Does South Africa Have the Potential and Capacity to Grow at 7 Per Cent?: A Labour Market Perspective
8. Expectations, money, and the forecasting of inflation
9. Testing Panel Data Regression Models with Spatial Error Correlation
10. QUEST II. A Multi-Country Business Cycle and Growth Model
11. Sex differences in the structure and stability of children’s playground social networks and their overlap with friendship relations
12. The name is absent
13. The mental map of Dutch entrepreneurs. Changes in the subjective rating of locations in the Netherlands, 1983-1993-2003
14. AN IMPROVED 2D OPTICAL FLOW SENSOR FOR MOTION SEGMENTATION
15. Endogenous Heterogeneity in Strategic Models: Symmetry-breaking via Strategic Substitutes and Nonconcavities
16. Naïve Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages
17. Pricing American-style Derivatives under the Heston Model Dynamics: A Fast Fourier Transformation in the Geske–Johnson Scheme
18. Nonparametric cointegration analysis
19. The name is absent
20. The name is absent