MACHINE LEARNING
270
irrelevant features. Examples of this use of background
knowledge will be presented later in the paper.
STRATEGIES FOR EXPLOITING CONTEXT
Katz et al. [6] list four strategies for using contextual
information when classifying:
1. Contextual normalization: The contextual features
can be used to normalize the context-sensitive pri-
mary features, prior to classification. The intent is to
process context-sensitive features in a way that
reduces their sensitivity to the context.
2. Contextual expansion: A feature space composed of
primary features can be expanded with contextual fea-
tures. The contextual features can be treated by the
classifier in the same manner as the primary features.
3. Contextual classifier selection: Classification can
proceed in two steps: First select a specialized classi-
fier from a set of classifiers, based on the contextual
features. Then apply the specialized classifier to the
primary features.
4. Contextual classification adjustment: The two steps
in strategy 3 can be reversed: First classify, using only
the primary features. Then make an adjustment to the
classification, based on the contextual features.
This paper examines strategies 1 and 2. A fifth strategy is
also investigated:
5. Contextual weighting: The contextual features can
be used to weight the primary features, prior to classi-
fication. The intent of weighting is to assign more
importance to features that, in a given context, are
more useful for classification.
The purpose of contextual normalization is to treat all
features equally, by removing the affects of context and
measurement scale. Contextual weighting has a different
purpose: to prefer some features over other features, if
they may improve accuracy.
THE CLASSIFICATION ALGORITHMS
To demonstrate the generality of the above strategies,
three different classification algorithms were used, a form
of instance-based learning (IBL) [1, 2, 3], multivariate
linear regression (MLR) [4], and cascade-correlation (CC)
[5].
Instance-based learning [1, 2] is closely related to the
nearest neighbor pattern recognition paradigm [3]. Predic-
tions are made by matching new data to stored data, using
a measure of similarity to find the best matches [3]. The
algorithm used here is a simple form of IBL, known as
single-nearest neighbor pattern recognition. The algorithm
is given, as input, an observation (a feature vector) in the
testing set. To classify this observation, the algorithm
simply looks for the most similar observation in the
training set (the single nearest neighbor). The output of the
algorithm is the class of the nearest neighbor in the
training set. The measure of similarity used here is based
on the sum of the absolute values of the differences
between the elements of the vectors. If x and y are two
feature vectors, then the similarity between x and y is
defined as:
∑(1 — X—yi∣) (9)
і
In this paper, IBL will be used to refer to this simple form
of instance-based learning. IBL easily handles both
symbolic [1] and real-valued [2] features and classes.
Multivariate linear regression [4] models data with a
system of linear equations. Like IBL, MLR easily handles
both symbolic and real-valued features and classes. The
algorithm used here is a form of MLR that is suitable for
symbolic classes, known as linear discriminant analysis.
In this paper, MLR will be used to refer to this form of
multivariate linear regression. Suppose that there are n
distinct classes. In the training phase, MLR generates n
linear equations, one for each of the n classes. The general
form of the n linear equations is:
У = ∑ aixi (10)
і
For example, consider the linear equation for one of the n
classes, class j. In the training phase, for each observation
in the training set, y is set to the value 1 when the obser-
vation belongs to class j. Otherwise, y is set to 0. The xi
in the equation for class j are selected from among the
features available in the feature space. MLR uses the
forward selection procedure to select the xi [4]. Standard
linear regression techniques are used to find the best
values for the constant coefficients ai in the linear
equation [7]. In the testing phase, MLR is given the values
of the xi variables for each observation in the testing set.
To predict the class of an observation, MLR calculates the
values of the n linear equations. This yields n values for
y, one value for each of the n classes. The predicted class
of the observation is the class with the largest calculated y
value.
Cascade-correlation (CC) [5] is a form of neural
network algorithm. Like IBL and MLR, CC easily handles
both symbolic and real-valued features and classes. The
CC algorithm is similar to feed-forward neural networks
trained with back-propagation. An interesting characteris-
tic of the CC algorithm is that the network architecture is