MACHINE LEARNING
272
cantly better than all of the alternatives that were
examined [7].
SPEECH RECOGNITION
This section examines strategies 1, 2, and 5: contex-
tual normalization, contextual expansion, and contextual
weighting. The problem is to recognize a vowel spoken by
an arbitrary speaker. There are ten continuous primary
features (derived from spectral data) and two discrete con-
textual features (the speaker’s identity and sex). The
observations fall in eleven classes (eleven different
vowels) [8].
For speech recognition, spectral data is a primary
feature for recognizing a vowel. The sex of the speaker is a
contextual feature, since we can achieve better recognition
by exploiting the fact that a man’s voice tends to sound
different from a woman’s voice. Sex is not a primary
feature, since knowing the speaker’s sex, by itself, does
not help us to recognize a vowel. The experimental design
ensures this, since all speakers spoke the same set of
vowels. This background knowledge lets us distinguish
primary and contextual features, without having to
determine the probability distribution.
The data were divided into a training set and a testing
set. Each of the eleven vowels was spoken six times by
each speaker. The training set is from four male and four
female speakers (11 × 6 × 8 = 528 observations). The
testing set is from four new male and three new female
speakers (11 × 6 × 7 = 462 observations). Using a wide
variety of neural network algorithms, Robinson [9]
achieved accuracies ranging from 33% to 56% correct on
the testing set. The mean score was 49%, with a standard
deviation of 6%. Table 3 summarizes Robinson’s results.
Three of the five strategies discussed above were
applied to the data:
Contextual normalization: Each feature was normalized
by equation (11), where the context vector c was simply
the speaker’s identity. The values of μ i(c) and σ i(c) were
estimated simply by taking the average and standard
deviation of xi for the speaker c. In a practical applica-
tion, this will require storing speech samples from a new
speaker in a buffer, until enough data are collected to
calculate the average and standard deviation.
Contextual expansion: The sex of the speaker was
treated as another feature. This strategy is not applicable to
the speaker’s identity, since the speakers in the testing set
are distinct from the speakers in the training set.
Contextual weighting: Let x be a vector of primary
features and let c be a vector of contextual features. As
with contextual normalization, the context vector c is
Table 3: Robinson’s (1989) results with the vowel data.
classifier |
no. of |
no. |
percent |
Single-layer perceptron |
- |
154 |
33 |
Multi layer perceptron |
88 |
234 |
51 |
Multi-layer perceptron |
22 |
206 |
45 |
Multi-layer perceptron |
11 |
203 |
44 |
Modified Kanerva Model |
528 |
231 |
50 |
Modified Kanerva Model |
88 |
197 |
43 |
Radial Basis Function |
528 |
247 |
53 |
Radial Basis Function |
88 |
220 |
48 |
Gaussian node network |
528 |
252 |
55 |
Gaussian node network |
88 |
247 |
53 |
Gaussian node network |
22 |
250 |
54 |
Gaussian node network |
11 |
211 |
47 |
Square node network |
88 |
253 |
55 |
Square node network |
22 |
236 |
51 |
Square node network |
11 |
217 |
50 |
Nearest neighbor |
- |
260 |
56 |
simply the speaker’s identity. The features were multiplied
by weights, where the weight wi for a feature xi was the
ratio of inter-class deviation σinter to intra-class deviation
i
intra
σ :
i
winter
σ
i
wi = -T^
i intra
σ i
(12)
The inter-class deviation of a feature indicates the
variation in a feature’s value, across class boundaries. It is
the average, for all speakers c in the training set, of the
standard deviation of the feature, across all classes (all
vowels), for a given speaker. Let σ1, ...,σm be the
standard deviations of xi for each of the m speakers in the
training set. The inter-class deviation of xi is:
winter
σ
i
1 m
1 ∑σ
m *-t j
J = 1
(13)
The intra-class deviation of a feature indicates the
variation in a feature’s value, within a class boundary. It is
the average, for all speakers in the training set and all
classes, of the standard deviation of the feature, for a given
speaker and a given class. Let {σ ,} , where 1 ≤ J ≤ m
j , k
and 1 ≤ k ≤ n, be the standard deviations of xi for each of