SATZ - An Adaptive Sentence Segmentation System

Computer Science Division; Palmer, David D.

PDF

Description

The segmentation of a text into sentences is a necessary prerequisite for many natural language processing tasks, including part-of-speech tagging and sentence alignment. This is a non-trivial task, however, since end-of-sentence punctuation marks are ambiguous. A period, for example, can denote a decimal point, an abbreviation, the end of a sentence, or even an abbreviation at the end of a sentence. To disambiguate punctuation marks most systems use brittle, special-purpose regular expression grammars and exception rules. Such approaches are usually limited to the text genre for which they were developed and cannot be easily adapted to new text types. They can also not be easily adapted to other natural languages.

As an alternative, I present an efficient, trainable algorithm that can be easily adapted to new text genres and some range of natural languages. The algorithm uses a lexicon with part-of-speech probabilities and a feed-forward neural network for rapid training. The method described requires minimal storage overhead and a very small amount of training data. The algorithm overcomes the limitations of existing methods and produces a very high accuracy.

The results presented demonstrate the successful implementation of the algorithm on a 27,294 sentence English corpus. Training time was less than one minute on a workstation and the method correctly labeled over 98.5% of the sentence boundaries. The method was also successful in labeling texts containing no capital letters. The system has been successfully adapted to German and French. The training times were similarly low and the resulting accuracy exceeded 99%.

Details

Title

SATZ - An Adaptive Sentence Segmentation System

Creator

Computer Science Division, Publisher
Palmer, David D., Author

Published

1994-12-01

Full Collection Name

Electrical Engineering & Computer Sciences Technical Reports

Other Identifiers

CSD-94-846

Type

Text

Format

technical reports

Extent

29 p

Archive

The Engineering Library

Usage Statement

Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).

Collection

EECS Technical Reports

Files

Statistics

Download Full History

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket