6 Experiments with Noun-Modifier Relations
This section describes experiments with 600 noun-modifier
pairs, hand-labeled with 30 classes of semantic relations
[Nastase and Szpakowicz, 2003]. We experiment with both
a 30-class problem and a 5-class problem. The 30 classes of
semantic relations include cause (e.g., in “flu virus”, the
head noun “virus” is the cause of the modifier “flu”), loca-
tion (e.g., in “home town”, the head noun “town” is the lo-
cation of the modifier “home”), part (e.g., in “printer tray”,
the head noun “tray” is part of the modifier “printer”), and
topic (e.g., in “weather report”, the head noun “report” is
about the topic “weather”). For a full list of classes, see Nas-
tase and Szpakowicz [2003] or Turney and Littman [2005].
The 30 classes belong to 5 general groups of relations,
causal relations, temporal relations, spatial relations, par-
ticipatory relations (e.g., in “student protest”, the “student”
is the agent who performs the “protest”; agent is a partici-
patory relation), and qualitative relations (e.g., in “oak tree”,
“oak” is a type of “tree”; type is a qualitative relation).
The following experiments use single nearest neighbour
classification with leave-one-out cross-validation. For
leave-one-out cross-validation, the testing set consists of a
single noun-modifier pair and the training set consists of the
599 remaining noun-modifiers. The data set is split 600
times, so that each noun-modifier gets a turn as the testing
word pair. The predicted class of the testing pair is the class
of the single nearest neighbour in the training set. As the
measure of nearness, we use LRA to calculate the relational
similarity between the testing pair and the training pairs.
Following Turney and Littman [2005], we evaluate the
performance by accuracy and also by the macroaveraged F
measure [Lewis, 1991]. The F measure is the harmonic
mean of precision and recall. Macroaveraging calculates the
precision, recall, and F for each class separately, and then
calculates the average across all classes.
There are 600 word pairs in the input set for LRA. In step
2, introducing alternate pairs multiplies the number of pairs
by four, resulting in 2,400 pairs. In step 5, for each pair A:B,
we add B:A, yielding 4,800 pairs. Some pairs are dropped
because they correspond to zero vectors and a few words do
not appear in Lin’s thesaurus. The sparse matrix (step 7) has
4,748 rows and 8,000 columns, with a density of 8.4%.
Table 3 shows the performance of LRA and VSM on the
30-class problem. VSM-AV is VSM with the AltaVista cor-
pus and VSM-WMTS is VSM with the WMTS corpus. The
results for VSM-AV are taken from Turney and Littman
[2005]. All three pairwise differences in the three F meas-
ures are statistically significant at the 95% level, according
to the Paired T-Test. The accuracy of LRA is significantly
higher than the accuracies of VSM-AV and VSM-WMTS,
according to the Fisher Exact Test, but the difference be-
tween the two VSM accuracies is not significant. Using the
same corpus as the VSM, LRA’s accuracy is 15% higher in
absolute terms and 61% higher in relative terms.
Table 4 compares the performance of LRA and VSM on
the 5-class problem. The accuracy and F measure of LRA
are significantly higher than the accuracies and F measures
of VSM-AV and VSM-WMTS, but the differences between
the two VSM accuracies and F measures are not significant.
Using the same corpus as the VSM, LRA’s accuracy is 14%
higher in absolute terms and 32% higher in relative terms. | |||
Table 3. Comparison of LRA and VSM on the 30-class problem. | |||
VSM-AV |
VSM-WMTS |
LRA | |
Correct |
167 |
148 |
239 |
Incorrect |
433 |
452 |
361 |
Total |
600 |
600 |
600 |
Accuracy |
27.8% |
24.7% |
39.8% |
Precision |
27.9% |
24.0% |
41.0% |
Recall |
26.8% |
20.9% |
35.9% |
F |
26.5% |
20.3% |
36.6% |
Table 4. Comparison of LRA and VSM on the 5-class problem. | |||
VSM-AV |
VSM-WMTS |
LRA | |
Correct |
274 |
264 |
348 |
Incorrect |
326 |
336 |
252 |
Total |
600 |
600 |
600 |
Accuracy |
45.7% |
44.0% |
58.0% |
Precision |
43.4% |
40.2% |
55.9% |
Recall |
43.1% |
41.4% |
53.6% |
F________ |
43.2% |
40.6% |
54.6% |
7 Discussion
The experimental results in Sections 5 and 6 demonstrate
that LRA performs significantly better than the VSM, but it
is also clear that there is room for improvement. The accu-
racy might not yet be adequate for practical applications,
although past work has shown that it is possible to adjust the
tradeoff of precision versus recall [Turney and Littman,
2005]. For some of the applications, such as information
extraction, LRA might be suitable if it is adjusted for high
precision, at the expense of low recall.
Another limitation is speed; it took almost nine days for
LRA to answer 374 analogy questions. However, with pro-
gress in computer hardware, speed will gradually become
less of a concern. Also, the software has not been optimized
for speed; there are several places where the efficiency
could be increased and many operations are parallelizable. It
may also be possible to precompute much of the information
for LRA, although this would require substantial changes to
the algorithm.
The difference in performance between VSM-AV and
VSM-WMTS shows that VSM is sensitive to the size of the
corpus. Although LRA is able to surpass VSM-AV when
the WMTS corpus is only about one tenth the size of the AV
corpus, it seems likely that LRA would perform better with
a larger corpus. The WMTS corpus requires one terabyte of
hard disk space, but progress in hardware will likely make
ten or even one hundred terabytes affordable in the rela-
tively near future.
For noun-modifier classification, more labeled data
should yield performance improvements. With 600 noun-
modifier pairs and 30 classes, the average class has only 20
examples. We expect that the accuracy would improve sub-