As with many problems in Natural Language Processing, word sense disambiguation is a difficult yet potentially very useful capability. Automatically determining the meanings of words with multiple definitions could benefit document classification, keyword searching, OCR, and many other applications that process text. Unfortunately, it is a challenge to design a system that can accurately cope with the idiosyncrasies of human language.
In this report we describe our attempts to improve the discrimination accuracy of the Yarowsky word sense disambiguation algorithm. The first of these experiments used an iterative approach to re-train the classifier. Our hope was that a corpus labeled by an imperfect classifier would make training material superior to an unlabeled corpus. By using the classifier's output from one iteration as its training input in the next, we tried to boost the accuracy of each successive cycle.
Our second experiment used part-of-speech information as an additional knowledge source for the Yarowsky algorithm. We pre-processed our training and test corpora with a part-of-speech tagger and used these tags to filter possible senses and improve the predictive power of words' contexts. Since part-of-speech tagging is a relatively mature technology with high accuracy, we expected it to improve the accuracy of the much more difficult word sense disambiguation process.
The third experiment modified the training phase of the Yarowsky algorithm by replacing its assumption of a uniform distribution of senses for a word with a more realistic one. We exploit the fact that our dictionary lists senses roughly in order by frequency of use to create a distribution that allows more accurate training.
Title
Experiments in Improving Unsupervised Word Sense Disambiguation
Published
2003-02-12
Full Collection Name
Electrical Engineering & Computer Sciences Technical Reports
Other Identifiers
CSD-03-1227
Type
Text
Extent
39 p
Archive
The Engineering Library
Usage Statement
Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).