Modeling the Multi-mode Distribution in Self-Supervised Language Models

Chang, Haw-Shiuan

Publication

Modeling the Multi-mode Distribution in Self-Supervised Language Models

Chang, Haw-Shiuan

Abstract

Self-supervised large language models (LMs) have become a highly-influential and foundational tool for many NLP models. For this reason, their expressivity is an important topic of study. In near-universal practice, given the language context, the model predicts a word from the vocabulary using a single embedded vector representation of both context and dictionary entries. Note that the context sometimes implies that the distribution over predicted words should be multi-modal in embedded space. However, the context’s single-vector representation provably fails to capture such a distribution. To address this limitation, we propose to represent context with multiple vector embeddings, which we term facets. This is distinct from previous work on multi-sense vocabulary embeddings, which employs multiple vectors for the dictionary entries, not the context. In this dissertation, we first present the theoretical limitations of the single context embedding in LMs and how the theoretical analyses suggest new alternative softmax layers that encode a context as multiple embeddings. The proposed alternatives achieve better perplexity than the mixture of softmax (MoS), especially given an ambiguous context, without adding significant computational cost to LMs. Our approaches also let GPT-2 learn to properly copy the entities from the context, which increases the coherence of the generated text without requiring any labels. In addition to predicting the next word, we also use multiple CLS embeddings to improve state-of-the-art pretraining methods for BERT on natural language understanding (NLU) benchmarks without introducing significant extra parameters or computations, especially when the training datasets are small. Furthermore, we show that our multi-facet embeddings improve the sequential recommendation, scientific paper embeddings, measurement of sentence similarity, distantly supervised relation extraction, unsupervised text pattern entailment detection, and cold-start citation recommendation. Finally, we use the multiple vector embeddings to predict the future topics of a context, and build on the basis, we propose a novel interactive language generation framework.

Type

dissertation

Date

2022-09

Degree

Doctor of Philosophy (PhD)

Advisors

Andrew McCallum

License

http://creativecommons.org/licenses/by/4.0/

Modeling the Multi-mode Distribution in Self-Supervised Language Models

Chang, Haw-Shiuan

Citations

Abstract

Type

Date

Publisher

Degree

Advisors

License

License

Files

Research Projects

Organizational Units

Journal Issue

Embargo Lift Date

URI

DOI

Publisher Version

Embedded videos

Collections

Related Item(s)