Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Document Type

Open Access Dissertation

Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded

2016

Month Degree Awarded

February

First Advisor

David A. Smith

Second Advisor

James Allan

Third Advisor

W. Bruce Croft

Fourth Advisor

Bruce Desmarais

Subject Categories

Artificial Intelligence and Robotics | Numerical Analysis and Scientific Computing | Theory and Algorithms

Abstract

Latent variable models of text, such as topic models, have been explored in many areas of natural language processing, information retrieval and machine translation to aid tasks such as exploratory data analysis, automated topic clustering and finding similar documents in mono- and multilingual collections. Many additional applications of these models, however, could be enabled by more efficient techniques for processing large datasets.

In this thesis, we introduce novel methods that offer efficient inference, search and evaluation for latent variable models of text. We present efficient, online inference for representing documents in several languages in a common topic space and fast approximations for finding near neighbors in the probability simplex representation of mono- and multilingual document collections. Empirical evaluations show that these methods are as accurate as —- and significantly faster than —- Gibbs sampling and brute-force all pairs search respectively. In addition, we present a new extrinsic evaluation metric that achieves very high correlation with common performance metrics while being more efficient to compute. We showcase the efficacy and efficiency of our new approaches on the problems of modeling and finding similar documents in a retrieval system for scientific papers, detecting document translation pairs, and extracting parallel sentences from large comparable corpora. This last task, in turn, allows us to efficiently train a translation model from comparable corpora that outperforms a model trained on parallel data.

Lastly, we improve the latent variable model representation of large documents in mono- and multilingual collections by introducing online inference for topic models with hierarchical Dirichlet prior structure over textual regions such as document sections. Modeling variations across textual regions using online inference offers a more effective and efficient document representation, beyond a bag of words, which is usually a handicap for the performance of these models on large documents.

Share

COinS