Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Author ORCID Identifier


Open Access Dissertation

Document Type


Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded


Month Degree Awarded


First Advisor

Andrew McCallum

Subject Categories

Artificial Intelligence and Robotics | Computer Sciences


Voice assistants such as Amazon Alexa, Apple Siri, and Google Assistant have become ubiquitous. They rely on spoken language understanding, which typically consists of an Automatic Speech Recognition (ASR) component and a Natural Language Understanding (NLU) component. ASR takes user speech as input and generates a text transcription. NLU takes the text transcription as input and generates a semantic parse to identify the requested actions, called intents (play music, turn on lights, etc.) and any relevant entities, called slots (which song to play? which lights to turn on?).

These components require massive amounts of training data to achieve good performance. In this dissertation, I identify and explore various data-related challenges to improve language understanding in voice assistants, specifically, the NLU component and the pipelined ASR-NLU architecture.

I first present a state-of-the-art NLU system based on sequence-to-sequence neural models that simplifies the traditional semantic parsing architecture, while also allowing it to handle complex user utterances consisting of multiple nested intents and slots. This work serves as an anchor for future data-constraint work. Next, I present an architecture to completely replace the pipelined ASR-NLU system with a fully end-to-end system. Our system is jointly trained on multiple speech-to-text and text-to-text tasks, allowing for transfer learning and also creating a shared representation for both speech and text. It outperforms previous pipelined and end-to-end systems, and performs end-to-end semantic parsing on a new domain by only training on a few text-to-text annotated NLU examples. Next, I demonstrate how to train large sequence-to-sequence NLU systems using a handful of examples by using auxiliary tasks to pre-train various components of the system. Finally, I demonstrate methods to perform low-resource domain adaptation. In low-resource domain adaptation, the goal is to parse utterances from a new domain using some simple metadata about the new domain and a small number of annotated training examples (few-shot) or no training examples (zero-shot) from that domain.


Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Available for download on Friday, September 01, 2023