The segmentation of a text into sentences is a necessary prerequisite for many natural language processing tasks, including part-of-speech tagging and sentence alignment. This is a non-trivial task, however, since end-of-sentence punctuation marks are ambiguous. A period, for example, can denote a decimal point, an abbreviation, the end of a sentence, or even an abbreviation at the end of a sentence. To disambiguate punctuation marks most systems use brittle, special-purpose regular expression grammars and exception rules. Such approaches are usually limited to the text genre for which they were developed and cannot be easily adapted to new text types. They can also not be easily adapted to other natural languages.

As an alternative, I present an efficient, trainable algorithm that can be easily adapted to new text genres and some range of natural languages. The algorithm uses a lexicon with part-of-speech probabilities and a feed-forward neural network for rapid training. The method described requires minimal storage overhead and a very small amount of training data. The algorithm overcomes the limitations of existing methods and produces a very high accuracy.

The results presented demonstrate the successful implementation of the algorithm on a 27,294 sentence English corpus. Training time was less than one minute on a workstation and the method correctly labeled over 98.5% of the sentence boundaries. The method was also successful in labeling texts containing no capital letters. The system has been successfully adapted to German and French. The training times were similarly low and the resulting accuracy exceeded 99%.




Download Full History