was gathered by a web crawler from US academic web sites
[Clarke et al., 1998]. The WMTS is a distributed (multi-
processor) search engine, designed primarily for passage
retrieval (although document retrieval is possible, as a spe-
cial case of passage retrieval). Our local copy runs on a
16-CPU Beowulf Cluster.
The WMTS is well suited to LRA, because it scales well
to large corpora (one terabyte, in our case), it gives exact
frequency counts (unlike most web search engines), it is
designed for passage retrieval (rather than document re-
trieval), and it has a powerful query syntax.
As a source of synonyms, we use Lin’s [1998] automati-
cally generated thesaurus. Lin’s thesaurus was generated by
parsing a corpus of about 5 × 107 English words, consisting
of text from the Wall Street Journal, San Jose Mercury, and
AP Newswire [Lin, 1998]. The parser was used to extract
pairs of words and their grammatical relations. Words were
then clustered into synonym sets, based on the similarity of
their grammatical relations. Two words were judged to be
highly similar when they tended to have the same kinds of
grammatical relations with the same sets of words.
Given a word and its part of speech, Lin’s thesaurus pro-
vides a list of words, sorted in order of decreasing attribu-
tional similarity. This sorting is convenient for LRA, since it
makes it possible to focus on words with higher attributional
similarity and ignore the rest.
We use Rohde’s SVDLIBC implementation of the Singu-
lar Value Decomposition, which is based on SVDPACKC
[Berry, 1992].
5 Experiments with Word Analogy Questions
Table 1 shows one of the 374 SAT analogy questions, along
with the relational similarities between the stem and each
choice, as calculated by LRA. The choice with the highest
relational similarity is also the correct answer for this ques-
tion (quart is to volume as mile is to distance).
Table 1. Relation similarity measures for a sample SAT question.
Stem:_________ |
quartvolume |
Relational similarity |
Choices: (a) |
day:night |
0.373725 |
(b) |
mile:distance |
0.677258 |
(c) |
decade:century |
0.388504 |
(d) |
friction:heat |
0.427860 |
_______________(e) |
part:whole_____ |
0.370172 |
LRA correctly answered 210 of the 374 analogy ques-
tions and incorrectly answered 160 questions. Four ques-
tions were skipped, because the stem pair and its alternates
did not appear together in any phrases in the corpus, so all
choices had a relational similarity of zero. Since there are
five choices for each question, we would expect to answer
20% of the questions correctly by random guessing. There-
fore we score the performance by giving one point for each
correct answer and 0.2 points for each skipped question.
LRA attained a score of 56.4% on the 374 SAT questions.
The average performance of college-bound senior high
school students on verbal SAT questions corresponds to a
score of about 57% [Turney and Littman, 2005]. The differ-
ence between the average human score and the score of
LRA is not statistically significant.
With 374 questions and 6 word pairs per question (one
stem and five choices), there are 2,244 pairs in the input set.
In step 2, introducing alternate pairs multiplies the number
of pairs by four, resulting in 8,976 pairs. In step 5, for each
pair A:B, we add B:A, yielding 17,952 pairs. However, some
pairs are dropped because they correspond to zero vectors
(they do not appear together in a window of five words in
the WMTS corpus). Also, a few words do not appear in
Lin’s thesaurus, and some word pairs appear twice in the
SAT questions (e.g., lion:cat). The sparse matrix (step 7) has
17,232 rows (word pairs) and 8,000 columns (patterns),
with a density of 5.8% (percentage of nonzero values).
Table 2 compares LRA to VSM with the 374 analogy
questions. VSM-AV refers to the VSM using AltaVista’s
database as a corpus. The VSM-AV results are taken from
Turney and Littman [2005]. We estimate the AltaVista
search index contained about 5 × 1011 English words at the
time the VSM-AV experiments took place. Turney and
Littman [2005] gave an estimate of 1 × 1011 English words,
but we believe this estimate was slightly conservative.
VSM-WMTS refers to the VSM using the WMTS, which
contains about 5 × 1010 English words. We generated the
VSM-WMTS results by adapting the VSM to the WMTS.
Table 2. LRA versus VSM with 374 SAT analogy questions.
VSM-AV |
VSM-WMTS |
LRA | |
Correct |
176 |
144 |
210 |
Incorrect |
193 |
196 |
160 |
Skipped |
5 |
34 |
4 |
Total |
374 |
374 |
374 |
Score |
47.3% |
40.3% |
56.4% |
All three pairwise differences in the three scores in Table
2 are statistically significant with 95% confidence, using the
Fisher Exact Test. Using the same corpus as the VSM, LRA
achieves a score of 56% whereas the VSM achieves a score
of 40%, an absolute difference of 16% and a relative im-
provement of 40%. When VSM has a corpus ten times lar-
ger than LRA’s corpus, LRA is still ahead, with an absolute
difference of 9% and a relative improvement of 19%.
Comparing VSM-AV to VSM-WMTS, the smaller cor-
pus has reduced the score of the VSM, but much of the drop
is due to the larger number of questions that were skipped
(34 for VSM-WMTS versus 5 for VSM-AV). With the
smaller corpus, many more of the input word pairs simply
do not appear together in short phrases in the corpus. LRA
is able to answer as many questions as VSM-AV, although
it uses the same corpus as VSM-WMTS, because Lin’s
[1998] thesaurus allows LRA to substitute synonyms for
words that are not in the corpus.
VSM-AV required 17 days to process the 374 analogy
questions [Turney and Littman, 2005], compared to 9 days
for LRA. As a courtesy to AltaVista, Turney and Littman
[2005] inserted a five second delay between each query.
Since the WMTS is running locally, there is no need for
delays. VSM-WMTS processed the questions in one day.