As libraries increasingly digitize their collections, there are growing numbers of scanned manuscripts that current OCR and handwriting recognition techniques cannot transcribe, because the systems are not trained for the scripts in which these manuscripts are written. Documents in this category range from illuminated medieval manuscripts to handwritten letters to early printed works. Without transcriptions, these documents remain unsearchable. Unfortunately with existing methods, a user must manually label large amounts of text in the target font to adapt the system to a new script. Some systems require that a user manually segment and label instances of each glyph. Others provide for less costly training, allowing a user to segment and label entire lines of text instead of individual characters. Still, the collections we consider are extremely diverse, to the extent that in some cases almost every document may be in a different style. Because of this, the cost of manually transcribing dozens of lines of text for each font is prohibitively high.

In this dissertation, we introduce methods that significantly reduce the manual labor involved in training a character recognizer to new scripts. Rather than forcing a user to transcribe portions of each target document, our system leverages general language statistics to identify regions of the document from which it may automatically extract new training exemplars. Unlike document specific transcriptions, these language statistics may be generated in a largely unsupervised manner, allowing our system to automate the process of building a model of scripts. We demonstrate the effectiveness of the model thus generated by using it to build a search engine for a Medieval illuminated manuscript.




Download Full History