Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Author ORCID Identifier



Open Access Dissertation

Document Type


Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded


First Advisor

Andrew McCallum

Second Advisor

Benjamin Marlin

Third Advisor

Daniel Sheldon

Fourth Advisor

Rajesh Bhatt

Subject Categories

Data Storage Systems


In data integration we transform information from a source into a target schema. A general problem in this task is loss of fidelity and coverage: the source expresses more knowledge than that can be fit into the target schema, or knowledge that is hard to fit into any schema at all. This problem is taken to an extreme in information extraction (IE) where the source is natural language---one of the most expressive forms of knowledge representation. To address this issue, one can either automatically learn a latent schema emergent in text (a brittle and ill-defined task), or manually define schemas. We propose instead to store data in a probabilistic representation of universal schema. This schema is simply the union of all source schemas, and we learn how to predict the cells of each source relation in this union. For example, we could store Freebase relations and relations that are expressed by natural language surface patterns. To populate such a database of universal schema, we present matrix factorization models that learn latent embedding vectors for entity tuples and relations. We show that such latent models achieve substantially higher accuracy than a traditional classification approach on New York Times and Freebase data. Besides binary relations, we use universal schema for unary relations, i.e., entity types. We explore various facets of universal schema matrix factorization models on a large-scale web corpus, including implicature among the relations. We evaluate our approach on the task of question answering using features obtained from universal schema, achieving state-of-the-art accuracy on a benchmark dataset.