Measuring Semantic Similarity by Latent Relational Analysis



was gathered by a web crawler from US academic web sites
[Clarke
et al., 1998]. The WMTS is a distributed (multi-
processor) search engine, designed primarily for passage
retrieval (although document retrieval is possible, as a spe-
cial case of passage retrieval). Our local copy runs on a
16-CPU Beowulf Cluster.

The WMTS is well suited to LRA, because it scales well
to large corpora (one terabyte, in our case), it gives exact
frequency counts (unlike most web search engines), it is
designed for passage retrieval (rather than document re-
trieval), and it has a powerful query syntax.

As a source of synonyms, we use Lin’s [1998] automati-
cally generated thesaurus. Lin’s thesaurus was generated by
parsing a corpus of about 5 × 107 English words, consisting
of text from the Wall Street Journal, San Jose Mercury, and
AP Newswire [Lin, 1998]. The parser was used to extract
pairs of words and their grammatical relations. Words were
then clustered into synonym sets, based on the similarity of
their grammatical relations. Two words were judged to be
highly similar when they tended to have the same kinds of
grammatical relations with the same sets of words.

Given a word and its part of speech, Lin’s thesaurus pro-
vides a list of words, sorted in order of decreasing attribu-
tional similarity. This sorting is convenient for LRA, since it
makes it possible to focus on words with higher attributional
similarity and ignore the rest.

We use Rohde’s SVDLIBC implementation of the Singu-
lar Value Decomposition, which is based on SVDPACKC
[Berry, 1992].

5 Experiments with Word Analogy Questions

Table 1 shows one of the 374 SAT analogy questions, along
with the relational similarities between the stem and each
choice, as calculated by LRA. The choice with the highest
relational similarity is also the correct answer for this ques-
tion (quart is to volume as mile is to distance).

Table 1. Relation similarity measures for a sample SAT question.

Stem:_________

quartvolume

Relational similarity

Choices: (a)

day:night

0.373725

(b)

mile:distance

0.677258

(c)

decade:century

0.388504

(d)

friction:heat

0.427860

_______________(e)

part:whole_____

0.370172

LRA correctly answered 210 of the 374 analogy ques-
tions and incorrectly answered 160 questions. Four ques-
tions were skipped, because the stem pair and its alternates
did not appear together in any phrases in the corpus, so all
choices had a relational similarity of zero. Since there are
five choices for each question, we would expect to answer
20% of the questions correctly by random guessing. There-
fore we score the performance by giving one point for each
correct answer and 0.2 points for each skipped question.
LRA attained a score of 56.4% on the 374 SAT questions.

The average performance of college-bound senior high
school students on verbal SAT questions corresponds to a
score of about 57% [Turney and Littman, 2005]. The differ-
ence between the average human score and the score of
LRA is not statistically significant.

With 374 questions and 6 word pairs per question (one
stem and five choices), there are 2,244 pairs in the input set.
In step 2, introducing alternate pairs multiplies the number
of pairs by four, resulting in 8,976 pairs. In step 5, for each
pair
A:B, we add B:A, yielding 17,952 pairs. However, some
pairs are dropped because they correspond to zero vectors
(they do not appear together in a window of five words in
the WMTS corpus). Also, a few words do not appear in
Lin’s thesaurus, and some word pairs appear twice in the
SAT questions (e.g., lion:cat). The sparse matrix (step 7) has
17,232 rows (word pairs) and 8,000 columns (patterns),
with a density of 5.8% (percentage of nonzero values).

Table 2 compares LRA to VSM with the 374 analogy
questions. VSM-AV refers to the VSM using AltaVista’s
database as a corpus. The VSM-AV results are taken from
Turney and Littman [2005]. We estimate the AltaVista
search index contained about 5 × 1011 English words at the
time the VSM-AV experiments took place. Turney and
Littman [2005] gave an estimate of 1 × 1011 English words,
but we believe this estimate was slightly conservative.
VSM-WMTS refers to the VSM using the WMTS, which
contains about 5 × 1010 English words. We generated the
VSM-WMTS results by adapting the VSM to the WMTS.

Table 2. LRA versus VSM with 374 SAT analogy questions.

VSM-AV

VSM-WMTS

LRA

Correct

176

144

210

Incorrect

193

196

160

Skipped

5

34

4

Total

374

374

374

Score

47.3%

40.3%

56.4%

All three pairwise differences in the three scores in Table
2 are statistically significant with 95% confidence, using the
Fisher Exact Test. Using the same corpus as the VSM, LRA
achieves a score of 56% whereas the VSM achieves a score
of 40%, an absolute difference of 16% and a relative im-
provement of 40%. When VSM has a corpus ten times lar-
ger than LRA’s corpus, LRA is still ahead, with an absolute
difference of 9% and a relative improvement of 19%.

Comparing VSM-AV to VSM-WMTS, the smaller cor-
pus has reduced the score of the VSM, but much of the drop
is due to the larger number of questions that were skipped
(34 for VSM-WMTS versus 5 for VSM-AV). With the
smaller corpus, many more of the input word pairs simply
do not appear together in short phrases in the corpus. LRA
is able to answer as many questions as VSM-AV, although
it uses the same corpus as VSM-WMTS, because Lin’s
[1998] thesaurus allows LRA to substitute synonyms for
words that are not in the corpus.

VSM-AV required 17 days to process the 374 analogy
questions [Turney and Littman, 2005], compared to 9 days
for LRA. As a courtesy to AltaVista, Turney and Littman
[2005] inserted a five second delay between each query.
Since the WMTS is running locally, there is no need for
delays. VSM-WMTS processed the questions in one day.



More intriguing information

1. On the Existence of the Moments of the Asymptotic Trace Statistic
2. The name is absent
3. The Economics of Uncovered Interest Parity Condition for Emerging Markets: A Survey
4. Assessing Economic Complexity with Input-Output Based Measures
5. Wounds and reinscriptions: schools, sexualities and performative subjects
6. Ruptures in the probability scale. Calculation of ruptures’ values
7. Imputing Dairy Producers' Quota Discount Rate Using the Individual Export Milk Program in Quebec
8. he Virtual Playground: an Educational Virtual Reality Environment for Evaluating Interactivity and Conceptual Learning
9. The demand for urban transport: An application of discrete choice model for Cadiz
10. Unemployment in an Interdependent World
11. Implementation of a 3GPP LTE Turbo Decoder Accelerator on GPU
12. The Advantage of Cooperatives under Asymmetric Cost Information
13. IMPACTS OF EPA DAIRY WASTE REGULATIONS ON FARM PROFITABILITY
14. The name is absent
15. The name is absent
16. The Determinants of Individual Trade Policy Preferences: International Survey Evidence
17. Why Managers Hold Shares of Their Firms: An Empirical Analysis
18. Neighborhood Effects, Public Housing and Unemployment in France
19. Endogenous Determination of FDI Growth and Economic Growth:The OECD Case
20. European Integration: Some stylised facts