MACHINE LEARNING
269
DEFINITION OF CONTEXT
This section presents a precise definition of context.
Let C be a finite set of classes. Let F be an n -dimensional
feature space. Let X = (x0, x 1, ..., xn) be a member of
C × F; that is, (x 1, ..., xn) ∈ F and x0 ∈ C. We will use
x to represent a variable and a = (a0, a 1, ..., an) to
represent a constant in C × F. Let p be a probability dis-
tribution defined on C × F. In the definitions that follow,
we will assume that p is a discrete distribution. It is easy
to extend these definitions for the continuous case.
Primary Feature: Feature xi (where 1 ≤ i ≤ n) is a
primary feature for predicting the class x 0 when there is a
value ai of xi and there is a value a0 of x0 such that:
P(x0 = a0∣ xi = ai) ≠P(x0 = a0) (1)
In other words, the probability that x0 = a0, given
xi = ai, is different from the probability that x0 = a0.
Contextual Feature: Feature xi (where 1 ≤ i ≤ n) is a
contextual feature for predicting the class x0 when xi is
not a primary feature for predicting the class x0 and there
is a value a of x such that:
p(x0 = a0∣ x 1 = a 1, ., xn = an)≠
P(x0 = a0∣ x 1 = a 1, ∙∙∙, xi - 1 = ai - 1, (2)
xi + 1 = ai + 1,..∙, xn = an)
In other words, if xi is a contextual feature, then we can
make a better prediction when we know the value ai of xi
than we can make when the value is unknown, assuming
that we know the values of the other features,
x 1,; x, - 1, x, + 1, ∙, xn .
The definitions above refer to the class x0. In the
following, we will assume that the class is fixed, so that
we do not need to explicitly mention the class.
Irrelevant Feature: Feature xi (where 1 ≤ i ≤ n) is an
irrelevant feature when xi is neither a primary feature nor
a contextual feature.
Context-Sensitive Feature: A primary feature xi is
context-sensitive to a contextual feature Xj when there are
values a0, ai, and θj, such that:
p(x0 = a0∣ xi = ai, xj = aj) ≠p(x0 = a0∣ xi = ai) (3)
The primary concern here is strategies for handling
context-sensitive features.
Table 1 illustrates the above definitions. Since
p (x 0 = 1) = 0.5 and p (x 0 = 1∣ x 1 = 1) = 0.44, it follows
that x 1 is a primary feature:
p(x0 = 1) ≠ p(x0 = 1∣ x 1 = 1) (4)
Since p(x0 = a0∣ x2 = a2) equals p(x0 = a0) for all values
a0 and a2, it follows that x2 is not a primary feature.
However, x2 is not an irrelevant feature, since:
p (x 0 = 1∣ x 1 = 1, x 2 = 1, x 3 = 1)
≠ p (x 0 = 1∣ x 1 = 1, x 3 = 1)
(5)
Therefore x2 is a contextual feature. Furthermore, primary
feature x 1 is context-sensitive to the contextual feature
x 2, since:
p (x 0 = 1∣ x 1 = 1, x 2 = 1) = 0.53 (6)
and
p (x 0 = 1∣ x 1 = 1) = 0.44 (7)
Finally, x3 is an irrelevant feature, since, for all values a0,
a 1, a2, and a3:
p(x0 = a0∣ x 1
= a 1, x 2
= a2, x3 = a3)
= p (x0 = a 01 x 1 = a 1, x2 = a 2)
(8)
When p is unknown, it is often possible to use back-
ground knowledge to distinguish primary, contextual, and
Table 1: Examples of the different types of features.
class x 0 |
primary x 1 |
contextual x 2 |
irrelevant x 3 |
probability |
0 |
0 |
0 |
0 |
003 |
0 |
0 |
0 |
1 |
0.03 |
0 |
0 |
1 |
0 |
0.08 |
0 |
0 |
1 |
1 |
0.08 |
0 |
1 |
0 |
0 |
0.07 |
0 |
1 |
0 |
1 |
0.07 |
0 |
1 |
1 |
0 |
0.07 |
0 |
1 |
1 |
1 |
0.07 |
1 |
0 |
0 |
0 |
0.07 |
1 |
0 |
0 |
1 |
0.07 |
1 |
0 |
1 |
0 |
0.07 |
1 |
0 |
1 |
1 |
0.07 |
1 |
1 |
0 |
0 |
0.03 |
1 |
1 |
0 |
1 |
0.03 |
1 |
1 |
1 |
0 |
0.08 |
1 |
1 |
1 |
1 |
0.08 |