National curriculum assessment: how to make it better



practical element. However, the problem with such an approach is the impact of task-
student interaction—put simply does that particular practical task suit the student? The
work of Rich Shavelson, Bob Linn and others (Linn & Baker, 1996; Shavelson, Ruiz-
Primo, & Wiley, 1999) shows that not until the student attempts six or more such tasks
does the average score across the tasks provide adequately reliable indications of the
student’s capability. We could therefore require each student to take at least six tasks
but this would be extremely time-consuming. More importantly, it is also unnecessary.
The purpose of the external assessment is to ensure that there is no advantage to the
teacher or the students of teaching only some of the curriculum, or teaching to only
some of the students. However, we can get the same assurance by light sampling, since
the teacher will not know which students will be tested on which parts of the
curriculum. Now of course the score that a particular student gets will not be a reliable
indicator of their achievement because of the task-student interaction described above,
but the average score of the class across the particular tasks taken will be an accurate
estimate of the average score of the class across all possible tasks (and one whose
accuracy we can judge precisely).

This does raise problems of comparability as Newton notes, but the nature of
comparability raised by a light sampling approach is not the same as that in which the
test is to be used to impute scores to individuals. The classic definition of test
equivalence is that two tests are equivalent (ie comparable) if it is a matter of
indifference to the candidate which test is taken (Lord & Novick, 1968). For a light
sampling scheme, assessments would be comparable to the extent that it was a matter of
indifference which particular allocation of tasks or tests to students was actually
administered. Of course there will be particular allocations where, by chance, each
student in a class is allocated the particular task or test that suits them best, but these
will be extremely rare. With such a scheme there would be no requirement for each task
or test to be strictly comparable to the others, in the same way that two tests can be
equivalent without an item-by-item equivalence. The reliability of the system would, of
course, have to investigated, but this could easily be undertaken by the allocation of
different sampling schemes to the same classes.

The marking of these tasks would, as Newton notes, be more complex than current
practice. It would not make sense to have one marker marking all of a school’s tasks
and tests because they would need to become familiar with the marking scheme for
every task and test. It would be much more sensible to send all the responses for a
particular task or test, from many schools, to one marker. While this sounds
administratively complex, with the use of bar-coding, this could be accomplished
relatively straightforwardly.

The number of tasks and tests that would be required would need further research to
determine, but, as Newton states, it would need to be very large to allow the re-use of
tasks and tests from year to year. In cases where factors affecting the difficulty of items
were well understood, item-shells might be used with computers to generate large sets
of items of similar difficulty, and most, if not, all of the tasks and tests could probably
be administered by computer within the next few years. Ultimately, even the marking of
open-ended items may be possible by computer. However, Newton is right to sound
cautions regarding the availability of people to design the tasks and tests, and it would
be several years before a large enough bank of tasks and tests could be built up.

Conclusion



More intriguing information

1. Iconic memory or icon?
2. An Estimated DSGE Model of the Indian Economy.
3. FDI Implications of Recent European Court of Justice Decision on Corporation Tax Matters
4. Heavy Hero or Digital Dummy: multimodal player-avatar relations in FINAL FANTASY 7
5. Commuting in multinodal urban systems: An empirical comparison of three alternative models
6. The name is absent
7. The name is absent
8. Permanent and Transitory Policy Shocks in an Empirical Macro Model with Asymmetric Information
9. The name is absent
10. The name is absent
11. Methods for the thematic synthesis of qualitative research in systematic reviews
12. Strategic Effects and Incentives in Multi-issue Bargaining Games
13. Before and After the Hartz Reforms: The Performance of Active Labour Market Policy in Germany
14. The name is absent
15. Regional Intergration and Migration: An Economic Geography Model with Hetergenous Labour Force
16. Outline of a new approach to the nature of mind
17. Estimated Open Economy New Keynesian Phillips Curves for the G7
18. Second Order Filter Distribution Approximations for Financial Time Series with Extreme Outlier
19. Innovation and business performance - a provisional multi-regional analysis
20. The Effects of Reforming the Chinese Dual-Track Price System