Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Author ORCID Identifier


Open Access Dissertation

Document Type


Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded


Month Degree Awarded


First Advisor

Andrew McCallum

Subject Categories

Artificial Intelligence and Robotics


Self-supervised large language models (LMs) have become a highly-influential and foundational tool for many NLP models. For this reason, their expressivity is an important topic of study. In near-universal practice, given the language context, the model predicts a word from the vocabulary using a single embedded vector representation of both context and dictionary entries. Note that the context sometimes implies that the distribution over predicted words should be multi-modal in embedded space. However, the context’s single-vector representation provably fails to capture such a distribution. To address this limitation, we propose to represent context with multiple vector embeddings, which we term facets. This is distinct from previous work on multi-sense vocabulary embeddings, which employs multiple vectors for the dictionary entries, not the context.

In this dissertation, we first present the theoretical limitations of the single context embedding in LMs and how the theoretical analyses suggest new alternative softmax layers that encode a context as multiple embeddings. The proposed alternatives achieve better perplexity than the mixture of softmax (MoS), especially given an ambiguous context, without adding significant computational cost to LMs. Our approaches also let GPT-2 learn to properly copy the entities from the context, which increases the coherence of the generated text without requiring any labels.

In addition to predicting the next word, we also use multiple CLS embeddings to improve state-of-the-art pretraining methods for BERT on natural language understanding (NLU) benchmarks without introducing significant extra parameters or computations, especially when the training datasets are small. Furthermore, we show that our multi-facet embeddings improve the sequential recommendation, scientific paper embeddings, measurement of sentence similarity, distantly supervised relation extraction, unsupervised text pattern entailment detection, and cold-start citation recommendation. Finally, we use the multiple vector embeddings to predict the future topics of a context, and build on the basis, we propose a novel interactive language generation framework.


Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.