Measuring Semantic Similarity by Latent Relational Analysis

tion answering, machine translation using parallel corpora,
information extraction, word sense disambiguation, text
summarization,measuringlexicalcohesion,identifyingsen-
timent and affect in text, and many other tasks in natural
language processing. This is the vision that motivates re-
searchinparaphrasing[BarzilayandMcKeown,2001]and
textualentailment[DaganandGlickman,2004],twotopics
thathavelatelyattractedmuchinterest.

Intheabsenceofsuchablackbox,currentapproachesto
theseproblemstypicallyusemeasuresofattributionalsimi-
larity. For example, the standard bag-of-wordsapproach to
information retrieval is based on attributional similarity
[Salton and McGill,1983].Given a query, a searchengine
produces a ranked list of documents, where the rank of a
document depends on the attributional similarity of the
document to the query. The attributes are based on word
frequencies;relationsbetweenwordsareignored.

Although attributional similarity measures are very use-
ful, we believe that they are limited and should be supple-
mented by relational similarity measures. Cognitive psy-
chologists have also argued that human similarity judge-
ments involve both attributional and relational similarity
[Medinet al.,1990].

Considerwordsensedisambiguationforexample.Iniso-
lation,theword“plant”couldrefertoanindustrialplantor
a living organism. Suppose the word “plant” appears in
sometextneartheword“food”.Atypicalapproachtodis-
ambiguating “plant” would compare the attributional simi-
larity of “food” and “industrial plant” to the attributional
similarity of “food” and “living organism” [Lesk, 1986;
BanerjeeandPedersen,2003].Inthiscase,thedecisionmay
not be clear, sinceindustrial plants often produce food and
livingorganismsoftenserveasfood.Itwouldbeveryhelp-
fultoknowtherelationbetween“food”and“plant”inthis
example. In the text “food for the plant”, the relation be-
tween food and plant strongly suggests that the plant is a
livingorganism,sinceindustrialplantsdonotneedfood.In
the text “food at the plant”, the relation strongly suggests
that the plant is an industrial plant, since living organisms
arenotusuallyconsideredaslocations.

A measure of relational similarity could potentially im-
prove the performance of any text processing application
thatcurrentlyusesameasureofattributionalsimilarity.We
believe relational similarity is the next step, after attribu-
tionalsimilarity,towardstheblackboxenvisionedabove.

3 Related Work

Let R1bethesemantic relationbetweenapair of words,A
and B,and letR₂ be thesemantic relation betweenanother
pair,Cand D.Wewishtomeasuretherelationalsimilarity
betweenR₁andR₂.TherelationsR₁andR₂arenotgivento
us; our task is to infer these hidden (latent) relations and
thencomparethem.

IntheVSMapproachofTurneyandLittman[2005],we
createvectors,r₁andr₂,thatrepresentfeaturesofR₁andR₂,
andmeasurethesimilarityofR₁andR₂bythecosineofthe
angle θbetween r1 = (r1.1,...,r^} and r2 = (r₂_i,...r-_t^ :

Σ r_u ∙

cosine(0) = ' 1 =

J∑ (r1' )2 ■ Σ (Г2' )²

∖ '=1 '=1

r1 ∙ r2

^hl Fl Г2ІГ

We make a vector, r, to characterize the relationship be-
tween two words, X and Y, by counting the frequencies of
variousshortphrasescontainingXandY.TurneyandLitt-
man [2005] use a list of 64 joining terms, such as “of”,
“for”, and “to”, to form 128 phrases that contain X and Y,
suchas“XofY”, “YofX”, “XforY”, “YforX”, “XtoY”,
and“YtoX”.Thesephrasesarethenusedasqueriesfora
searchengineandthenumberofhits(matchingdocuments)
is recorded for each query. This process yields a vector of
128numbers.Ifthenumberofhitsforaqueryisx,thenthe
correspondingelementinthevectorris log(x + 1) .

TurneyandLittman[2005]evaluatedtheVSMapproach
by its performance on 374 college-level multiple-choice
SATanalogyquestions,achievingascoreof47%.ASAT
analogy question consists of a target word pair, called the
stem, and five cho'ce word pairs. To answer an analogy
question, vectors are created for the stem pair and each
choice pair, and then cosines are calculated for the angles
between the stem vector and each choice vector. The best
guessisthechoicepairwiththehighestcosine.Weusethe
samesetofanalogyquestionstoevaluateLRAinSection5.

ThebestpreviousperformanceontheSATquestionswas
achievedbycombiningthirteenseparatemodules[Turneyet
al.,2003].TheperformanceofLRAsignificantlysurpasses
thiscombinedsystem,butthereisnorealcontestbetween
these approaches, because we can simply add LRA to the
combination,asafourteenthmodule.SincetheVSMmod-
ulehadthebestperformanceofthethirteenmodules[Tur-
neyet al.,2003],thefollowingexperimentsfocusoncom-
paringVSMandLRA.

TheVSMwasalsoevaluatedbyitsperformanceasadis-
tance measure in a supervised nearest neighbour classifier
for noun-modifier semantic relations [Turney and Littman,
2005].Theproblemistoclassifyanoun-modifierpair,such
as “laser printer”, according to the semantic relation be-
tweentheheadnoun(printer)andthemodifier(laser).The
evaluation used 600 noun-modifier pairs that have been
manuallylabeledwith30classesofsemanticrelations[Nas-
taseandSzpakowicz,2003].Forexample,“laserprinter”is
classifiedas'nstrument;theprinterusesthelaserasanin-
strumentforprinting.Atestingpairisclassifiedbysearch-
ing for its single nearest neighbour in the labeled training
data.Thebestguessisthelabelforthetrainingpairwiththe
highestcosine;thatis,thetrainingpairthatismost analo-
goustothetestingpair,accordingtoVSM.LRAisevalu-
atedwiththesamesetofnoun-modifierpairsinSection6.

4 Latent Relational Analysis

LRAtakesasinputasetofwordpairsandproducesasout-
put a measure of the relational similarity between any two
of the input pairs. LRA relies on three resources, (1) a
searchenginewithaverylargecorpusoftext,(2)abroad-
coverage thesaurus of synonyms, and (3) an efficient im-

More intriguing information

1. The name is absent
2. SME'S SUPPORT AND REGIONAL POLICY IN EU - THE NORTE-LITORAL PORTUGUESE EXPERIENCE
3. Lending to Agribusinesses in Zambia
4. Strategic monetary policy in a monetary union with non-atomistic wage setters
5. Who runs the IFIs?
6. TRADE NEGOTIATIONS AND THE FUTURE OF AMERICAN AGRICULTURE
7. Business Networks and Performance: A Spatial Approach
8. Who is missing from higher education?
9. Secondary stress in Brazilian Portuguese: the interplay between production and perception studies
10. Survey of Literature on Covered and Uncovered Interest Parities