Doctoral Dissertations

Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Poetry: Identification, Entity Recognition, and Retrieval

John J. Foley IV, University of Massachusetts AmherstFollow

Author ORCID Identifier

https://orcid.org/0000-0002-5058-293X

AccessType

Open Access Dissertation

Document Type

dissertation

Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded

2019

Month Degree Awarded

May

First Advisor

James Allan

Second Advisor

W. Bruce Croft

Third Advisor

Brendan O'Connor

Fourth Advisor

Joe Pater

Subject Categories

Computer Sciences

Abstract

Modern advances in natural language processing (NLP) and information retrieval (IR) provide for the ability to automatically analyze, categorize, process and search textual resources. However, generalizing these approaches remains an open problem: models that appear to understand certain types of data must be re-trained on other domains. Often, models make assumptions about the length, structure, discourse model and vocabulary used by a particular corpus. Trained models can often become biased toward an original dataset, learning that – for example – all capitalized words are names of people or that short documents are more relevant than longer documents. As a result, small amounts of noise or shifts in style can cause models to fail on unseen data. The key to more robust models is to look at text analytics tasks on more challenging and diverse data. Poetry is an ancient art form that is believed to pre-date writing and is still a key form of expression through text today. Some poetry forms (e.g., haiku and sonnets) have rigid structure but still break our traditional expectations of text. Other poetry forms drop punctuation and other rules in favor of expression. Our contributions include a set of novel, challenging datasets that extend traditional tasks: a text classification task for which content features perform poorly, a named entity recognition task that is inherently ambiguous, and a retrieval corpus over the largest public collection of poetry ever released. We begin by looking at poetry identification - the task of finding poetry within existing textual collections, and devise an effective method of extracting poetry based on how it is usually formatted within digitally scanned books, since content models do not generalize well. Then we work on the content of poetry: we construct a dataset of around 6,000 tagged spans that identify the people, places, organizations and personified concepts within poetry. We show that cross-training with existing datasets based on news-corpora helps modern models to learn to recognize entities within poetry. Finally, we return to IR, and construct a dataset of queries and documents inspired by real-world data that expose some of the key challenges of searching through poetry. Our work is the first significant effort to use poetry in these three tasks and our datasets and models will provide strong baselines for new avenues of research on this challenging domain.

DOI

https://doi.org/10.7275/14103760

Recommended Citation

Foley, John J. IV, "Poetry: Identification, Entity Recognition, and Retrieval" (2019). Doctoral Dissertations. 1573.
https://doi.org/10.7275/14103760 https://scholarworks.umass.edu/dissertations_2/1573

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Download

Included in

Computer Sciences Commons

COinS

ScholarWorks@UMass Amherst

Doctoral Dissertations

Poetry: Identification, Entity Recognition, and Retrieval

Author ORCID Identifier

AccessType

Document Type

Degree Name

Degree Program

Year Degree Awarded

Month Degree Awarded

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Subject Categories

Abstract

DOI

Recommended Citation

Creative Commons License

Included in

Browse

Author Corner

Links

ScholarWorks@UMass Amherst

Doctoral Dissertations

Poetry: Identification, Entity Recognition, and Retrieval

Author

Author ORCID Identifier

AccessType

Document Type

Degree Name

Degree Program

Year Degree Awarded

Month Degree Awarded

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Subject Categories

Abstract

DOI

Recommended Citation

Creative Commons License

Included in

Share

Browse

Author Corner

Links