As libraries increasingly digitize their collections, there are growing numbers of scanned manuscripts that current OCR and handwriting recognition techniques cannot transcribe, because the systems are not trained for the scripts in which these manuscripts are written. Documents in this category range from illuminated medieval manuscripts to handwritten letters to early printed works. Without transcriptions, these documents remain unsearchable. Unfortunately with existing methods, a user must manually label large amounts of text in the target font to adapt the system to a new script. Some systems require that a user manually segment and label instances of each glyph. Others provide for less costly training, allowing a user to segment and label entire lines of text instead of individual characters. Still, the collections we consider are extremely diverse, to the extent that in some cases almost every document may be in a different style. Because of this, the cost of manually transcribing dozens of lines of text for each font is prohibitively high.
In this dissertation, we introduce methods that significantly reduce the manual labor involved in training a character recognizer to new scripts. Rather than forcing a user to transcribe portions of each target document, our system leverages general language statistics to identify regions of the document from which it may automatically extract new training exemplars. Unlike document specific transcriptions, these language statistics may be generated in a largely unsupervised manner, allowing our system to automate the process of building a model of scripts. We demonstrate the effectiveness of the model thus generated by using it to build a search engine for a Medieval illuminated manuscript.
Easily Adaptable Handwriting Recognition in Historical Manuscripts
Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).