Measuring Semantic Similarity by Latent Relational Analysis



6 Experiments with Noun-Modifier Relations

This section describes experiments with 600 noun-modifier
pairs, hand-labeled with 30 classes of semantic relations
[Nastase and Szpakowicz, 2003]. We experiment with both
a 30-class problem and a 5-class problem. The 30 classes of
semantic relations include
cause (e.g., in “flu virus”, the
head noun “virus” is the
cause of the modifier “flu”), loca-
tion
(e.g., in “home town”, the head noun “town” is the lo-
cation
of the modifier “home”), part (e.g., in “printer tray”,
the head noun “tray” is
part of the modifier “printer”), and
topic (e.g., in “weather report”, the head noun “report” is
about the
topic “weather”). For a full list of classes, see Nas-
tase and Szpakowicz [2003] or Turney and Littman [2005].
The 30 classes belong to 5 general groups of relations,
causal relations, temporal relations, spatial relations, par-
ticipatory
relations (e.g., in “student protest”, the “student”
is the
agent who performs the “protest”; agent is a partici-
patory
relation), and qualitative relations (e.g., in “oak tree”,
“oak” is a
type of “tree”; type is a qualitative relation).

The following experiments use single nearest neighbour
classification with leave-one-out cross-validation. For
leave-one-out cross-validation, the testing set consists of a
single noun-modifier pair and the training set consists of the
599 remaining noun-modifiers. The data set is split 600
times, so that each noun-modifier gets a turn as the testing
word pair. The predicted class of the testing pair is the class
of the single nearest neighbour in the training set. As the
measure of nearness, we use LRA to calculate the relational
similarity between the testing pair and the training pairs.

Following Turney and Littman [2005], we evaluate the
performance by accuracy and also by the macroaveraged F
measure [Lewis, 1991]. The F measure is the harmonic
mean of precision and recall. Macroaveraging calculates the
precision, recall, and F for each class separately, and then
calculates the average across all classes.

There are 600 word pairs in the input set for LRA. In step
2, introducing alternate pairs multiplies the number of pairs
by four, resulting in 2,400 pairs. In step 5, for each pair
A:B,
we add B:A, yielding 4,800 pairs. Some pairs are dropped
because they correspond to zero vectors and a few words do
not appear in Lin’s thesaurus. The sparse matrix (step 7) has
4,748 rows and 8,000 columns, with a density of 8.4%.

Table 3 shows the performance of LRA and VSM on the
30-class problem. VSM-AV is VSM with the AltaVista cor-
pus and VSM-WMTS is VSM with the WMTS corpus. The
results for VSM-AV are taken from Turney and Littman
[2005]. All three pairwise differences in the three F meas-
ures are statistically significant at the 95% level, according
to the Paired T-Test. The accuracy of LRA is significantly
higher than the accuracies of VSM-AV and VSM-WMTS,
according to the Fisher Exact Test, but the difference be-
tween the two VSM accuracies is not significant. Using the
same corpus as the VSM, LRA’s accuracy is 15% higher in
absolute terms and 61% higher in relative terms.

Table 4 compares the performance of LRA and VSM on
the 5-class problem. The accuracy and F measure of LRA
are significantly higher than the accuracies and F measures
of VSM-AV and VSM-WMTS, but the differences between

the two VSM accuracies and F measures are not significant.
Using the same corpus as the VSM, LRA’s accuracy is 14%

higher in absolute terms and 32% higher in relative terms.

Table 3. Comparison of LRA and VSM on the 30-class problem.

VSM-AV

VSM-WMTS

LRA

Correct

167

148

239

Incorrect

433

452

361

Total

600

600

600

Accuracy

27.8%

24.7%

39.8%

Precision

27.9%

24.0%

41.0%

Recall

26.8%

20.9%

35.9%

F

26.5%

20.3%

36.6%

Table 4. Comparison of LRA and VSM on the 5-class problem.

VSM-AV

VSM-WMTS

LRA

Correct

274

264

348

Incorrect

326

336

252

Total

600

600

600

Accuracy

45.7%

44.0%

58.0%

Precision

43.4%

40.2%

55.9%

Recall

43.1%

41.4%

53.6%

F________

43.2%

40.6%

54.6%

7 Discussion

The experimental results in Sections 5 and 6 demonstrate
that LRA performs significantly better than the VSM, but it
is also clear that there is room for improvement. The accu-
racy might not yet be adequate for practical applications,
although past work has shown that it is possible to adjust the
tradeoff of precision versus recall [Turney and Littman,
2005]. For some of the applications, such as information
extraction, LRA might be suitable if it is adjusted for high
precision, at the expense of low recall.

Another limitation is speed; it took almost nine days for
LRA to answer 374 analogy questions. However, with pro-
gress in computer hardware, speed will gradually become
less of a concern. Also, the software has not been optimized
for speed; there are several places where the efficiency
could be increased and many operations are parallelizable. It
may also be possible to precompute much of the information
for LRA, although this would require substantial changes to
the algorithm.

The difference in performance between VSM-AV and
VSM-WMTS shows that VSM is sensitive to the size of the
corpus. Although LRA is able to surpass VSM-AV when
the WMTS corpus is only about one tenth the size of the AV
corpus, it seems likely that LRA would perform better with
a larger corpus. The WMTS corpus requires one terabyte of
hard disk space, but progress in hardware will likely make
ten or even one hundred terabytes affordable in the rela-
tively near future.

For noun-modifier classification, more labeled data
should yield performance improvements. With 600 noun-
modifier pairs and 30 classes, the average class has only 20
examples. We expect that the accuracy would improve sub-



More intriguing information

1. The name is absent
2. THE CHANGING STRUCTURE OF AGRICULTURE
3. The name is absent
4. The name is absent
5. Qualification-Mismatch and Long-Term Unemployment in a Growth-Matching Model
6. The name is absent
7. Mergers under endogenous minimum quality standard: a note
8. Competition In or For the Field: Which is Better
9. Monetary Discretion, Pricing Complementarity and Dynamic Multiple Equilibria
10. Solidaristic Wage Bargaining
11. Tax Increment Financing for Optimal Open Space Preservation: an Economic Inquiry
12. A Dynamic Model of Conflict and Cooperation
13. Wirkt eine Preisregulierung nur auf den Preis?: Anmerkungen zu den Wirkungen einer Preisregulierung auf das Werbevolumen
14. Elicited bid functions in (a)symmetric first-price auctions
15. The name is absent
16. The Nobel Memorial Prize for Robert F. Engle
17. The name is absent
18. Language discrimination by human newborns and by cotton-top tamarin monkeys
19. Social Cohesion as a Real-life Phenomenon: Exploring the Validity of the Universalist and Particularist Perspectives
20. PRIORITIES IN THE CHANGING WORLD OF AGRICULTURE