Loading...
Citations
Altmetric:
Abstract
At the core of recent advancements in natural language processing are language models, which are trained to predict the next token given the preceding context. Recent developments in deep learning has led to the efficient scaling of context window in Transformer-based language models. Despite the progress, these models still exhibit severe limitations when tackling long-context tasks, such as book-level summarization and long-document question answering. While the context window size has been continuously increasing, there is a lack of understanding on how these models utilize long-range context, or context that spans at least several thousand tokens. As such, we first provide an analysis of long-range context modeling with both perplexity and segment-level task evaluations. Our results show that perplexity, the most commonly used intrinsic metric for language model evaluation, may obscure the evaluation of long-range context modeling. In contrast, segment-level evaluation, which involves computing the probability of a sequence of tokens rather than a single token as done in perplexity, proves to be a more suitable method for evaluating long-range context modeling. Based on this finding, we enhance the segment-level evaluation by proposing a challenge dataset ChapterBreak, and demonstrate that SuffixLM, a model trained with segment-level signals, outperforms the standard token-level language model in this task. The limited context modeling capability prompts us to investigate new ways to improve recent large language models. To this end, we first develop a prompting framework, PEARL, by leveraging large instruction fine-tuned language models to decompose complex reasoning into executable plans. We demonstrate the efficacy of PEARL on a subset of the long-document QA dataset, where the correct answer depends on the long-range context instead of a short excerpt. Our second approach builds on the benefits of modeling context at the segment level. Concretely, we propose a new training method, SuffixRL, by fine-tuning a token-level language model directly using segment-level signals. We show that training models with SuffixRL leads to more natural and coherent continuations in an open-ended generation setting. Finally, we conclude this thesis by identifying seven concrete topics that hold promise for future exploration. We hope this thesis can spur more principled research in long-context modeling.
Type
Dissertation
Date
2024-05
Publisher
Degree
Advisors
License
Attribution 4.0 International
License
http://creativecommons.org/licenses/by/4.0/