Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.
Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.
Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.
Author ORCID Identifier
https://orcid.org/0000-0002-8663-2628
AccessType
Open Access Dissertation
Document Type
dissertation
Degree Name
Doctor of Philosophy (PhD)
Degree Program
Computer Science
Year Degree Awarded
2021
Month Degree Awarded
September
First Advisor
Brendan T. O'Connor
Subject Categories
Computational Engineering
Abstract
People have been analyzing documents by reading keywords in context for centuries. Traditional approaches like paper concordances or digital keyword-in-context viewers display all occurrences of a single word from a corpus vocabulary amid immediately surrounding tokens or characters, to show readers how individual lexical items are used in bodies of text. We propose that these common tools are one particular application of a more general approach to analyzing documents, which we define as lexical corpus analysis. We then propose new natural language processing techniques for lexically-focused corpus investigation, and demonstrate how such methods can be used to create new user-facing tools for analyzing corpora.
Our contributions are divided into three parts. In Part 1, we consider how to represent a corpus lexicon to best reflect human mental and linguistic models of a domain, and propose a natural language processing (NLP) method for enriching a unigram corpus vocabulary with multiword phases. In Part 2, we consider how lexical systems might show query terms in context to best satisfy user search need, and offer several new techniques focused on summarizing mentions of a query term in context. Finally, in Part 3, we apply our proposed NLP methods towards new user-facing systems for lexical corpus analysis, and present user studies with journalists and historians which investigate how new lexical tools can help such users in their work.
DOI
https://doi.org/10.7275/24608032
Recommended Citation
Handler, Abram Kaufman, "Natural Language Processing for Lexical Corpus Analysis" (2021). Doctoral Dissertations. 2332.
https://doi.org/10.7275/24608032
https://scholarworks.umass.edu/dissertations_2/2332
Creative Commons License
This work is licensed under a Creative Commons Attribution-No Derivative Works 4.0 License.