kept primarily for formative purposes, to the summative level that would be reported to
students and their parents. The teacher would be free (indeed would be expected) to
discount evidence related to ‘trick’ questions like the one given above, when arriving at
a level.
The overall profile of levels for a class would be ‘moderated’ by the external tasks and
tests (see below), which would ensure that the levels awarded could not be inflated by
the teacher. There are many ways in which this could be done—the most severe would
be to use the results of the external tasks and tests to define an ‘envelope’ of levels that
the teacher was allowed to award, so that the distribution of the levels in the summative
levels given by the teacher would have to be exactly the same as that for the external
tasks and tests. In addition, in order to check that the teacher’s weighting of various
aspects of the domain was something similar to those intended in the curriculum,
requirements for correlation could be imposed, so that, to some extent at least, those
getting high marks on the tasks and tests would be awarded high levels. However, this
would be a crude measure, and there is no doubt that additional ways would be needed
to detect and, where possible, eliminate the forms of bias noted by Newton (eg over-
emphasis on certain aspects of the domain, and inclusion of construct irrelevant
variance, such as halo effects).
As I note in Wiliam (2000a), care must also be taken to avoid the teacher’s role in
summative assessment driving underground formative evidence (eg when students do
not divulge difficulties to the teacher because they believe it will be ‘held against
them’). Ultimately, this can only be resolved through trust, but it can be ameliorated
through the depersonalisation of the assessment procedure—while the assessment of the
student against the criteria is undertaken by the teacher, it is important that the student
understands that the criteria themselves are not determined by the teacher, but are
external. Although not perfect, the teacher could then still claim to be the student’s ally.
Newton also raises the question of whether teachers’ assessments would, as I have
claimed, be more reliable than those arising from tests. He is right to point out that
continuous assessment over the period of the key stage is not a replication of the final
assessment, and it would, indeed, be invidious if a student’s level were reduced by the
teacher because the last recorded evidence of a particular aspect of the domain dated
from the previous year. However, if we adopt the conceptual framework provided by
generalisability theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972), then it is a
logical necessity that the degree of unreliability contributed by student-task interactions
will be lower than for a traditional test because there are more tasks, provided, of
course, teachers can apply the correct standards, and base their levels on the ‘latest and
best’ evidence. If we can control the other sources of unreliability (task-rater
interactions, student-rater interactions, etc—see below) then teachers assessments will
be more reliable than tests.
Evaluative assessments
The fundamental feature of the assessment system I propose is that the evaluative
function of the assessment is based on light sampling. The logic of this is
straightforward. In order to avoid the possibility of ‘teaching to the test’ we need to
assess a greater proportion of the domain of interest. More precisely we want to create a
situation in which we are happy for teachers to teach to the test, because the only way to
improve the test score is to improve the performance of students on the whole domain.
Newton seems to suggest that this could be achieved by adding ‘authentic’ elements to
the existing assessment, as happens in some science examinations which involved a