MACHINE LEARNING
274
patients in each interval. The values of μ i(c) and σ i(c)
were estimated by taking the average and standard
deviation of xi for each interval c. This is different from
the method used for contextual normalization with the
continuous contextual features in gas turbine engine
diagnosis [7]. Note that equation (11) does not require
continuous features; it works well with the boolean
features in the hepatitis data, when true and false are repre-
sented by one and zero.
Contextual expansion: The age of the patient was treated
as another feature. This strategy is not useful for the
patient’s sex, since so few patients are female.
Contextual weighting: The features were multiplied by
weights, where the weight for a feature was the ratio of
inter-class deviation to intra-class deviation, as in equation
(12). The inter-class deviation and the intra-class deviation
were calculated using the five age intervals.
Table 5 shows the results of using different combina-
tions of the three strategies (contextual normalization,
contextual expansion, and contextual weighting) with
IBL. As in the previous section, there is a form of synergy
here, since the sum of the improvements of each strategy
used separately is less than the improvement of the three
strategies together ((71 - 71) + (71 - 71) +
(83 - 71) = 12% for the sum of the three strategies
versus 84 - 71 = 13% for the three strategies used
together). In this case, however, the synergy is not as
marked as it is in the previous section. This may be due to
the fact that there is no systematic difference between the
training and testing sets in the hepatitis data, while the
testing set for the vowel data uses different speakers from
the training set.
For comparison, other researchers have reported accu-
racies of 80% [11] and 83% [12] on the hepatitis data. It is
interesting that a single-nearest neighbor algorithm can
match or surpass these results, when strategies are
employed to use the contextual information contained in
the data.
DISCUSSION OF RESULTS
The results reported above indicate that contextual
normalization and contextual weighting can significantly
improve the accuracy of classification. Contextual
expansion is less effective than contextual normalization
and contextual weighting, although it appears useful,
when used in conjunction with the other techniques.
Equation (11) (a form of contextual normalization)
has three characteristics:
1. The normalized features all have the same scale, so
we can directly compare features that were originally
measured with different scales.
2. Equation (11) tends to weight features according to
their relevance for classification. Features that are far
from average, in a given context, are normalized to
values that are far from zero. That is, a surprising fea-
ture will get a high absolute value. A feature that is
irrelevant will tend to have a high variation, so it will
tend to be normalized to a value near zero. A feature
that is near average will also be normalized to a value
near zero. Note that this is true for boolean features,
as well as continuous features.
3. Equation (11) compensates for variations in a feature
that are due to variations in the context. Thus it
reduces the impact of the context, allowing the classi-
fication system to generalize across different contexts
more easily.
Equation (11) is only one possible form of contextual nor-
malization. For example, another form of contextual nor-
malization could use a context-sensitive estimate of the
minimum and maximum values to normalize a feature.
Contextual weighting is a new technique for using
contextual information. The idea of contextual weighting
is to assign more weight to the features that seem more
useful for classification, in a given context. Equation (12)
is only one possible form of contextual weighting. For
example, another form of contextual weighting might vary
the weight as a function of the context. With equation (12),
the weight is calculated using contextual information, but
the weight does not change as a function of the context.
Note that equation (11) is a linear transformation of
the data when the context c is constant, but it is a
nonlinear transformation when the context is variable.
Equation (12) is a linear transformation of the data, both
when the context c is constant and when it is variable,
since the weight wj is fixed; it does not vary with the
context c.
Of the three classification algorithms, IBL gained the
most from contextual normalization and contextual
weighting. The form of IBL that was used here (single-
nearest neighbor with sum of absolute values as a distance
measure) is particularly sensitive to the scales of the
features. If one feature ranges from 0 to 100 and the
remaining features range from 0 to 1, then the first feature
will have much more influence on the distance measure
than the remaining features. Therefore IBL can benefit sig-
nificantly from contextual normalization, which attempts
to equalize scales. MLR and CC are designed to be unaf-
fected by linear transformations of the features. Therefore
they do not favor features with larger ranges. However,
this strength is also a weakness, because MLR and CC
cannot benefit from preprocessing of the data that
increases the scale of more significant variables. For
example, contextual weighting (using equation (12)) has
no effect on MLR and it has only minor effects on CC.