Experiments in Improving Unsupervised Word Sense Disambiguation

Wilensky, Robert; Computer Science Division; Traupman, Jonathan

PDF

Description

As with many problems in Natural Language Processing, word sense disambiguation is a difficult yet potentially very useful capability. Automatically determining the meanings of words with multiple definitions could benefit document classification, keyword searching, OCR, and many other applications that process text. Unfortunately, it is a challenge to design a system that can accurately cope with the idiosyncrasies of human language.

In this report we describe our attempts to improve the discrimination accuracy of the Yarowsky word sense disambiguation algorithm. The first of these experiments used an iterative approach to re-train the classifier. Our hope was that a corpus labeled by an imperfect classifier would make training material superior to an unlabeled corpus. By using the classifier's output from one iteration as its training input in the next, we tried to boost the accuracy of each successive cycle.

Our second experiment used part-of-speech information as an additional knowledge source for the Yarowsky algorithm. We pre-processed our training and test corpora with a part-of-speech tagger and used these tags to filter possible senses and improve the predictive power of words' contexts. Since part-of-speech tagging is a relatively mature technology with high accuracy, we expected it to improve the accuracy of the much more difficult word sense disambiguation process.

The third experiment modified the training phase of the Yarowsky algorithm by replacing its assumption of a uniform distribution of senses for a word with a more realistic one. We exploit the fact that our dictionary lists senses roughly in order by frequency of use to create a distribution that allows more accurate training.

Details

Title

Experiments in Improving Unsupervised Word Sense Disambiguation

Creator

Wilensky, Robert, Author
Computer Science Division, Publisher
Traupman, Jonathan, Author

Published

2003-02-12

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

CSD-03-1227

Type

Text

Format

technical reports

Extent

39 p

Archive

The Engineering Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket