practical element. However, the problem with such an approach is the impact of task-
student interaction—put simply does that particular practical task suit the student? The
work of Rich Shavelson, Bob Linn and others (Linn & Baker, 1996; Shavelson, Ruiz-
Primo, & Wiley, 1999) shows that not until the student attempts six or more such tasks
does the average score across the tasks provide adequately reliable indications of the
student’s capability. We could therefore require each student to take at least six tasks
but this would be extremely time-consuming. More importantly, it is also unnecessary.
The purpose of the external assessment is to ensure that there is no advantage to the
teacher or the students of teaching only some of the curriculum, or teaching to only
some of the students. However, we can get the same assurance by light sampling, since
the teacher will not know which students will be tested on which parts of the
curriculum. Now of course the score that a particular student gets will not be a reliable
indicator of their achievement because of the task-student interaction described above,
but the average score of the class across the particular tasks taken will be an accurate
estimate of the average score of the class across all possible tasks (and one whose
accuracy we can judge precisely).
This does raise problems of comparability as Newton notes, but the nature of
comparability raised by a light sampling approach is not the same as that in which the
test is to be used to impute scores to individuals. The classic definition of test
equivalence is that two tests are equivalent (ie comparable) if it is a matter of
indifference to the candidate which test is taken (Lord & Novick, 1968). For a light
sampling scheme, assessments would be comparable to the extent that it was a matter of
indifference which particular allocation of tasks or tests to students was actually
administered. Of course there will be particular allocations where, by chance, each
student in a class is allocated the particular task or test that suits them best, but these
will be extremely rare. With such a scheme there would be no requirement for each task
or test to be strictly comparable to the others, in the same way that two tests can be
equivalent without an item-by-item equivalence. The reliability of the system would, of
course, have to investigated, but this could easily be undertaken by the allocation of
different sampling schemes to the same classes.
The marking of these tasks would, as Newton notes, be more complex than current
practice. It would not make sense to have one marker marking all of a school’s tasks
and tests because they would need to become familiar with the marking scheme for
every task and test. It would be much more sensible to send all the responses for a
particular task or test, from many schools, to one marker. While this sounds
administratively complex, with the use of bar-coding, this could be accomplished
relatively straightforwardly.
The number of tasks and tests that would be required would need further research to
determine, but, as Newton states, it would need to be very large to allow the re-use of
tasks and tests from year to year. In cases where factors affecting the difficulty of items
were well understood, item-shells might be used with computers to generate large sets
of items of similar difficulty, and most, if not, all of the tasks and tests could probably
be administered by computer within the next few years. Ultimately, even the marking of
open-ended items may be possible by computer. However, Newton is right to sound
cautions regarding the availability of people to design the tasks and tests, and it would
be several years before a large enough bank of tasks and tests could be built up.