Loading...
Citations
Altmetric:
Abstract
In data integration we transform information from a source into a target schema. A general problem in this task is loss of fidelity and coverage: the source expresses more knowledge than that can be fit into the target schema, or knowledge that is hard to fit into any schema at all. This problem is taken to an extreme in information extraction (IE) where the source is natural language---one of the most expressive forms of knowledge representation. To address this issue, one can either automatically learn a latent schema emergent in text (a brittle and ill-defined task), or manually define schemas. We propose instead to store data in a probabilistic representation of universal schema. This schema is simply the union of all source schemas, and we learn how to predict the cells of each source relation in this union. For example, we could store Freebase relations and relations that are expressed by natural language surface patterns. To populate such a database of universal schema, we present matrix factorization models that learn latent embedding vectors for entity tuples and relations. We show that such latent models achieve substantially higher accuracy than a traditional classification approach on New York Times and Freebase data. Besides binary relations, we use universal schema for unary relations, i.e., entity types. We explore various facets of universal schema matrix factorization models on a large-scale web corpus, including implicature among the relations. We evaluate our approach on the task of question answering using features obtained from universal schema, achieving state-of-the-art accuracy on a benchmark dataset.
Type
dissertation
Date
2015