Word Sense Disambiguation by Web Mining for Word Co-occurrence Probabilities



System

Fine-Grained Recall

Coarse-Grained Recall

Best Senseval-3 System

72.9%

79.5%

NRC-Fine

69.4%

75.9%

NRC-Fine2

69.1%

75.6%

NRC-Coarse

NA

75.8%

NRC-Coarse2

NA

75.7%

Median Senseval-3 System

65.1%

73.7%

Most Frequent Sense

____________55.2%

______________64.5%

Table 2: Comparison of NRC-Fine with other Senseval-3 ELS systems.

coarse-grained scores of our four entries with other
Senseval-3 systems.

With NRC-Fine and NRC-Coarse, each seman-
tic feature was scored by calculating its PMI with
the head word, and then low scoring semantic fea-
tures were dropped. With NRC-Fine2 and NRC-
Coarse2, the threshold for dropping features was
changed, so that many more features were retained.
The Senseval-3 results suggest that it is better to
drop more features.

NRC-Coarse and NRC-Coarse2 were designed to
maximize the coarse score, by training them with
data in which the senses were relabeled by their
coarse sense equivalence classes. The fine scores
for these two systems are meaningless and should be
ignored. The Senseval-3 results indicate that there
is no advantage to relabeling.

The NRC systems scored roughly midway
between the best and median systems. This per-
formance supports the hypothesis that corpus-based
semantic features can be useful for WSD. In future
work, we plan to design a system that combines
corpus-based semantic features with the most effec-
tive elements of the other Senseval-3 systems.

For reasons of computational efficiency, we chose
a relatively narrow window of nine-words around
the head word. We intend to investigate whether a
larger window would bring the system performance
up to the level of the best Senseval-3 system.

4 Conclusion

This paper has sketched the NRC WSD system for
the ELS task in Senseval-3. Due to space limita-
tions, many details were omitted, but it is likely that
their impact on the performance is relatively small.

The system design is relatively straightforward
and classical. The most innovative aspect of the sys-
tem is the set of semantic features, which are purely
corpus-based; no lexicon was used.

Acknowledgements

We are very grateful to Egidio Terra, Charlie Clarke,
and the School of Computer Science of the Univer-
sity of Waterloo, for giving us a copy of the Water-
loo MultiText System. Thanks to Diana Inkpen,
Joel Martin, and Mario Jarmasz for helpful discus-
sions. Thanks to the organizers of Senseval for their
service to the WSD research community. Thanks to
Eric Brill and the developers of Weka, for making
their software available.

References

Eric Brill. 1994. Some advances in transformation-
based part of speech tagging. In
Proceedings of
the 12th National Conference on Artificial Intel-
Iigence (AAAI-94)
, pages 722-727.

Charles L.A. Clarke and Gordon V. Cormack. 2000.
Shortest substring retrieval and ranking.
ACM
Transactions on Information Systems (TOIS)
,
18(1):44-78.

Charles L.A. Clarke, G.V. Cormack, and F.J.
Burkowski. 1995. An algebra for structured text
search and a framework for its implementation.
The Computer Journal, 38(1):43-56.

Egidio L. Terra and Charles L.A. Clarke. 2003.
Frequency estimates for statistical word similari-
ty measures. In
Proceedings of the Human Lan-
guage Technology and North American Chapter
of Association of Computational Linguistics Con-
ference 2003 (HLT/NAACL 2003)
, pages 244-
251.

Peter D. Turney. 2001. Mining the Web for syn-
onyms: PMI-IR versus LSA on TOEFL. In
Pro-
ceedings of the Twelfth European Conference
on Machine Learning (ECML-2001)
, pages 491-
502.

Peter D. Turney. 2003. Coherent keyphrase extrac-
tion via Web mining. In
Proceedings of the Eigh-
teenth International Joint Conference on Artifi-
cial Intelligence (IJCAI-03)
, pages 434-439.

Ian H. Witten and Eibe Frank. 1999. Data Min-
ing: Practical Machine Learning Tools and
Techniques with Java Implementations
. Morgan
Kaufmann, San Mateo, CA.

D. Yarowsky, S. Cucerzan, R. Florian, C. Schafer,
and R. Wicentowski. 2001. The Johns Hopkins
SENSEVAL2 system descriptions. In
Proceed-
ings of SENSEVAL2
, pages 163-166.



More intriguing information

1. DETERMINANTS OF FOOD AWAY FROM HOME AMONG AFRICAN-AMERICANS
2. Real Exchange Rate Misalignment: Prelude to Crisis?
3. Delayed Manifestation of T ransurethral Syndrome as a Complication of T ransurethral Prostatic Resection
4. How does an infant acquire the ability of joint attention?: A Constructive Approach
5. O funcionalismo de Sellars: uma pesquisa histδrica
6. How we might be able to understand the brain
7. The Formation of Wenzhou Footwear Clusters: How Were the Entry Barriers Overcome?
8. The name is absent
9. The name is absent
10. ALTERNATIVE TRADE POLICIES
11. Text of a letter
12. The name is absent
13. ‘Goodwill is not enough’
14. The name is absent
15. Manufacturing Earnings and Cycles: New Evidence
16. The name is absent
17. TWENTY-FIVE YEARS OF RESEARCH ON WOMEN FARMERS IN AFRICA: LESSONS AND IMPLICATIONS FOR AGRICULTURAL RESEARCH INSTITUTIONS; WITH AN ANNOTATED BIBLIOGRAPHY
18. Standards behaviours face to innovation of the entrepreneurships of Beira Interior
19. The name is absent
20. BILL 187 - THE AGRICULTURAL EMPLOYEES PROTECTION ACT: A SPECIAL REPORT