Measuring Semantic Similarity by Latent Relational Analysis

6 Experiments with Noun-Modifier Relations

This section describes experiments with 600 noun-modifier
pairs, hand-labeled with 30 classes of semantic relations
[Nastase and Szpakowicz, 2003]. We experiment with both
a 30-class problem and a 5-class problem. The 30 classes of
semantic relations include cause (e.g., in “flu virus”, the
head noun “virus” is the cause of the modifier “flu”), loca-
tion (e.g., in “home town”, the head noun “town” is the lo-
cation of the modifier “home”), part (e.g., in “printer tray”,
the head noun “tray” is part of the modifier “printer”), and
topic (e.g., in “weather report”, the head noun “report” is
about the topic “weather”). For a full list of classes, see Nas-
tase and Szpakowicz [2003] or Turney and Littman [2005].
The 30 classes belong to 5 general groups of relations,
causal relations, temporal relations, spatial relations, par-
ticipatory relations (e.g., in “student protest”, the “student”
is the agent who performs the “protest”; agent is a partici-
patory relation), and qualitative relations (e.g., in “oak tree”,
“oak” is a type of “tree”; type is a qualitative relation).

The following experiments use single nearest neighbour
classification with leave-one-out cross-validation. For
leave-one-out cross-validation, the testing set consists of a
single noun-modifier pair and the training set consists of the
599 remaining noun-modifiers. The data set is split 600
times, so that each noun-modifier gets a turn as the testing
word pair. The predicted class of the testing pair is the class
of the single nearest neighbour in the training set. As the
measure of nearness, we use LRA to calculate the relational
similarity between the testing pair and the training pairs.

Following Turney and Littman [2005], we evaluate the
performance by accuracy and also by the macroaveraged F
measure [Lewis, 1991]. The F measure is the harmonic
mean of precision and recall. Macroaveraging calculates the
precision, recall, and F for each class separately, and then
calculates the average across all classes.

There are 600 word pairs in the input set for LRA. In step
2, introducing alternate pairs multiplies the number of pairs
by four, resulting in 2,400 pairs. In step 5, for each pair A:B,
we add B:A, yielding 4,800 pairs. Some pairs are dropped
because they correspond to zero vectors and a few words do
not appear in Lin’s thesaurus. The sparse matrix (step 7) has
4,748 rows and 8,000 columns, with a density of 8.4%.

Table 3 shows the performance of LRA and VSM on the
30-class problem. VSM-AV is VSM with the AltaVista cor-
pus and VSM-WMTS is VSM with the WMTS corpus. The
results for VSM-AV are taken from Turney and Littman
[2005]. All three pairwise differences in the three F meas-
ures are statistically significant at the 95% level, according
to the Paired T-Test. The accuracy of LRA is significantly
higher than the accuracies of VSM-AV and VSM-WMTS,
according to the Fisher Exact Test, but the difference be-
tween the two VSM accuracies is not significant. Using the
same corpus as the VSM, LRA’s accuracy is 15% higher in
absolute terms and 61% higher in relative terms.

Table 4 compares the performance of LRA and VSM on
the 5-class problem. The accuracy and F measure of LRA
are significantly higher than the accuracies and F measures
of VSM-AV and VSM-WMTS, but the differences between

the two VSM accuracies and F measures are not significant.
Using the same corpus as the VSM, LRA’s accuracy is 14%

higher in absolute terms and 32% higher in relative terms.
Table 3. Comparison of LRA and VSM on the 30-class problem.
	VSM-AV	VSM-WMTS	LRA
Correct	167	148	239
Incorrect	433	452	361
Total	600	600	600
Accuracy	27.8%	24.7%	39.8%
Precision	27.9%	24.0%	41.0%
Recall	26.8%	20.9%	35.9%
F	26.5%	20.3%	36.6%
Table 4. Comparison of LRA and VSM on the 5-class problem.
	VSM-AV	VSM-WMTS	LRA
Correct	274	264	348
Incorrect	326	336	252
Total	600	600	600
Accuracy	45.7%	44.0%	58.0%
Precision	43.4%	40.2%	55.9%
Recall	43.1%	41.4%	53.6%
F________	43.2%	40.6%	54.6%

7 Discussion

The experimental results in Sections 5 and 6 demonstrate
that LRA performs significantly better than the VSM, but it
is also clear that there is room for improvement. The accu-
racy might not yet be adequate for practical applications,
although past work has shown that it is possible to adjust the
tradeoff of precision versus recall [Turney and Littman,
2005]. For some of the applications, such as information
extraction, LRA might be suitable if it is adjusted for high
precision, at the expense of low recall.

Another limitation is speed; it took almost nine days for
LRA to answer 374 analogy questions. However, with pro-
gress in computer hardware, speed will gradually become
less of a concern. Also, the software has not been optimized
for speed; there are several places where the efficiency
could be increased and many operations are parallelizable. It
may also be possible to precompute much of the information
for LRA, although this would require substantial changes to
the algorithm.

The difference in performance between VSM-AV and
VSM-WMTS shows that VSM is sensitive to the size of the
corpus. Although LRA is able to surpass VSM-AV when
the WMTS corpus is only about one tenth the size of the AV
corpus, it seems likely that LRA would perform better with
a larger corpus. The WMTS corpus requires one terabyte of
hard disk space, but progress in hardware will likely make
ten or even one hundred terabytes affordable in the rela-
tively near future.

For noun-modifier classification, more labeled data
should yield performance improvements. With 600 noun-
modifier pairs and 30 classes, the average class has only 20
examples. We expect that the accuracy would improve sub-

More intriguing information

1. The name is absent
2. APPLYING BIOSOLIDS: ISSUES FOR VIRGINIA AGRICULTURE
3. The name is absent
4. The name is absent
5. Short Term Memory May Be the Depletion of the Readily Releasable Pool of Presynaptic Neurotransmitter Vesicles
6. Großhandel: Steigende Umsätze und schwungvolle Investitionsdynamik
7. The name is absent
8. Modelling the health related benefits of environmental policies - a CGE analysis for the eu countries with gem-e3
9. Large Scale Studies in den deutschen Sozialwissenschaften:Stand und Perspektiven. Bericht über einen Workshop der Deutschen Forschungsgemeinschaft
10. Concerns for Equity and the Optimal Co-Payments for Publicly Provided Health Care