Off-campus UMass Amherst users: To download dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users, please click the view more button below to purchase a copy of this dissertation from Proquest.

(Some titles may also be available free of charge in our Open Access Dissertation Collection, so please check there first.)

Retrieval of handwritten historical document images

Toni Maximilian Rath, University of Massachusetts Amherst

Abstract

Historical library collections across the world hold huge numbers of handwritten documents. By digitizing these manuscripts, their content can be preserved and made available to a large community via the Internet or other electronic media. Such corpora can nowadays be shared relatively easily, but they are often large, unstructured, and only available in image formats, which makes them difficult to access. In particular, finding specific locations of interest in a handwritten image collection is generally very tedious without some sort of index or other access tool. The current solution for this problem is to manually annotate a historical collection, which is very costly in terms of time and money. In this work we explore several automatic techniques that allow the retrieval of handwritten document images with text queries. These are (i) word spotting, an approach that clusters word images to identify and annotate content-bearing words in a collection, (ii) handwriting recognition followed by text retrieval, and (iii) cross-modal retrieval models, which capture the joint occurrence of annotations and word image features in a probabilistic model. We compare the performance of these approaches empirically on several test collections. The main contributions of this work are a detailed examination of retrieval approaches for historical manuscripts, and the development of the first image retrieval system for historical manuscripts that allows text queries. This system extends the field of digital libraries beyond machine printed text into historical handwritten documents. Building such a system involves challenges on numerous levels: the noisy historical manuscript domain requires adequate image filtering, normalization and representation techniques, as well as a robust and scalable retrieval framework. We describe the construction of a prototype system, which demonstrates the feasibility of the proposed techniques for a large collection of handwritten historical documents.

Subject Area

Computer science

Recommended Citation

Rath, Toni Maximilian, "Retrieval of handwritten historical document images" (2005). Doctoral Dissertations Available from Proquest. AAI3193936.
https://scholarworks.umass.edu/dissertations/AAI3193936

Share

COinS