plementation of SVD. LRA does not use labeled data, struc-
tured data, or supervised learning.
LRA proceeds as follows:
1. Find alternates: For each word pair A:B in the input set,
look in the thesaurus for the top num_sim words (in the fol-
lowing experiments, num_sim = 10) that are most similar to
A. For each A' that is similar to A, make a new word pair
A':B . Likewise, look for the top num_sim words that are
most similar to B, and for each B', make a new word pair
A:B'. A:B is called the original pair and each A':B or
A:B' is an alternate pair. The intent is for alternates to have
almost the same semantic relations as the original.
2. Filter alternates: For each original pair A:B, filter the
2 × num_sim alternates as follows. For each alternate pair,
send a query to the search engine, to find the frequency of
phrases that begin with one member of the pair and end with
the other. The phrases cannot have more than max_phrase
words (we use max_phrase = 5). Sort the alternate pairs by
the frequency of their phrases. Select the top num_filter most
frequent alternates and discard the remainder (we use
num_filter = 3, so 17 alternates are dropped). This step tends
to eliminate alternates that have no clear semantic relation.
3. Find phrases: For each pair (originals and alternates),
make a list of phrases in the corpus that contain the pair.
Query the search engine for all phrases that begin with one
member of the pair and end with the other, with a minimum
of min_inter intervening words and a maximum of max_inter
intervening words (we use min_inter = 1, max_inter = 3 =
max_phrase - 2). We ignore suffixes when searching for
phrases that match a given pair. These phrases reflect the
semantic relations between the words in each pair.
4. Find patterns: For each phrase found in the previous
step, build patterns from the intervening words. A pattern is
constructed by replacing any or all or none of the interven-
ing words with wild cards (one wild card can only replace
one word). For each pattern, count the number of pairs
(originals and alternates) with phrases that match the pattern
(a wild card must match exactly one word). Keep the top
num_patterns most frequent patterns and discard the rest (we
use num_patterns = 4,000). Typically there will be millions
of patterns, so it is not feasible to keep them all.
5. Map pairs to rows: In preparation for building a matrix
X , suitable for SVD, create a mapping of word pairs to row
numbers. For each pair A:B, create a row for A:B and an-
other row for B:A. This will make the matrix more symmet-
rical, reflecting our knowledge that the relational similarity
between A:B and C:D should be the same as the relational
similarity between B:A and D:C. (Mason is to stone as car-
penter is to wood. Stone is to mason as wood is to carpen-
ter.) The intent is to assist SVD by enforcing this symmetry
in the matrix.
6. Map patterns to columns: Create a mapping of the top
num_patterns patterns to column numbers. For each pattern
P, create a column for “word1 P word2” and another column
for “word2 P word1”. Thus there will be 2 × num_patterns
columns in X .
7. Generate a sparse matrix: Generate a matrix X in
sparse matrix format. The value for the cell in row i and
column j is the frequency of the j-th pattern (see step 6) in
phrases that contain the i-th word pair (see step 5).
8. Calculate entropy: Apply log and entropy transforma-
tions to the sparse matrix [Landauer and Dumais, 1997].
Each cell is replaced with its logarithm, multiplied by a
weight based on the negative entropy of the corresponding
column vector in the matrix. This gives more weight to pat-
terns that vary substantially in frequency for each pair.
9. Apply SVD: After log and entropy transformations, ap-
ply SVD to X . SVD decomposes X into a product of three
matrices UΣVT , where U and V are in column orthonor-
mal form (i.e., the columns are orthogonal and have unit
length) and Σ is a diagonal matrix of singular values
(hence SVD) [Golub and Van Loan, 1996]. If X is of rank
r , then Σ is also of rank r . Let Σk , where k < r , be the
diagonal matrix formed from the top k singular values, and
let Uk and Vk be the matrices produced by selecting the
corresponding columns from U and V . The matrix
U k Σk VkT is the matrix of rank k that best approximates the
original matrix X , in the sense that it minimizes the ap-
proximation errors [Golub and Van Loan, 1996]. We may
think of this matrix UkΣkVkT as a “smoothed” or “com-
pressed” version of the original matrix. SVD is used to re-
duce noise and compensate for sparseness.
10. Projection: Calculate UkΣk (we use k = 300, as rec-
ommended by Landauer and Dumais [1997]). This matrix
has the same number of rows as X , but only k columns
(instead of 2 × num_patterns columns; in our experiments,
that is 300 columns instead of 8,000). We do not use V ,
because we want to calculate the cosines between row vec-
tors, and it can be proven that the cosine between any two
row vectors in UkΣk is the same as the cosine between the
corresponding two row vectors in UkΣkVkT .
11. Evaluate alternates: Let A:B and C:D be any two word
pairs in the input set. From step 2, we have (num_filter + 1)
versions of A:B, the original and num_filter alternates. Like-
wise, we have (num_filter + 1) versions of C:D. Therefore
we have (num_filter + 1)2 ways to compare a version of A:B
with a version of C:D. Look for the row vectors in UkΣk
that correspond to the versions of A:B and the versions of
C:D and calculate the (num_filter + 1)2 cosines (in our ex-
periments, there are 16 cosines).
12. Calculate relational similarity: The relational similar-
ity between A:B and C:D is the average of the cosines,
among the (num_filter + 1)2 cosines from step 11, that are
greater than or equal to the cosine of the original pairs, A:B
and C:D. The requirement that the cosine must be greater
than or equal to the original cosine is a way of filtering out
poor analogies, which may be introduced in step 1 and may
have slipped through the filtering in step 2. Averaging the
cosines, as opposed to taking their maximum, is intended to
provide some resistance to noise.
In our experiments, the input set contains from 600 to
2,244 word pairs. Steps 11 and 12 can be repeated for each
two input pairs that are to be compared.
In the following experiments, we use a local copy of the
Waterloo MultiText System (WMTS) as the search engine,
with a corpus of about 5 × 1010 English words. The corpus