Topic Characterization of Full Length Texts Using Direct and Indirect Term Evidence

Computer Science Division; Fisher, David E.

PDF

Description

This project evaluates two families of algorithms that can be used to automatically classify general texts within a set of conceptual categories. The first family uses indirect evidence in the form of term-category co-occurrence data. The second uses direct evidence based on the senses of the terms, where a term's senses are designated by the categories that it is a member of in a thesaurus. The direct evidence algorithms incorporate varying degrees of indirect evidence as well.

For these experiments a set of 3,864 conceptual categories were derived from the noun hierarchy of WordNet, an on-line thesaurus. The co-occurrence data for the associational and disambiguation algorithms was collected from a corpus of 3,711 AP newswire articles, comprising approximately 1.7 million words of text. Each of the algorithms was applied to all of the articles in the AP corpus, with their performance evaluated both qualitatively and quantitatively.

The results of these experiments show that both classes of algorithms have potential as fully automatic text classifiers. The direct methods produce qualitatively better classifications than the indirect ones when applied to AP newswire texts. The direct methods also achieve both a higher precision, 86.75% correctly classified (best case) versus 72.34%, and a higher approximate recall.

The experiments identify limiting factors on the performance of the algorithms. The primary limitations stem from the quality of the thesaural categories, which were derived automatically, and from the performance of the term sense disambiguation algorithm. The former can be addressed with human intervention, the latter with a larger training set for the statistical database.

Details

Title

Topic Characterization of Full Length Texts Using Direct and Indirect Term Evidence

Creator

Computer Science Division, Publisher
Fisher, David E., Author

Published

1994-05-01

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

CSD-94-809

Type

Text

Format

technical reports

Extent

33 p

Archive

The Engineering Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket