Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Author ORCID Identifier


Open Access Dissertation

Document Type


Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded


Month Degree Awarded


First Advisor

Hong Yu

Subject Categories

Artificial Intelligence and Robotics


A supervised machine learning model is trained with a large set of labeled training data, and evaluated on a smaller but still large set of test data. Especially with deep neural networks (DNNs), the complexity of the model requires that an extremely large data set is collected to prevent overfitting. It is often the case that these models do not take into account specific attributes of the training set examples, but instead treat each equally in the process of model training. This is due to the fact that it is difficult to model latent traits of individual examples at the scale of hundreds of thousands or millions of data points. However, there exist a set of psychometric methods that can model attributes of specific examples and can greatly improve model training and evaluation in the supervised learning process. Item Response Theory (IRT) is a well-studied psychometric methodology for scale construction and evaluation. IRT jointly models human ability and example characteristics such as difficulty based on human response data. We introduce new evaluation metrics for both humans and machine learning models build using IRT, and propose new methods for applying IRT to machine learning-scale data. We use IRT to make contributions to the machine learning community in the following areas: (i) new test sets for evaluating machine learning models with respect to a human population, (ii) new insights about how deep-learning models learn by tracking example difficulty and training conditions, and (iii) new methods for data selection and curriculum building to improve model training efficiency, (iv) a new test of electronic health literacy built with questions extracted from de-identified patient Electronic Health Records (EHRs). We first introduce two new evaluation sets built and validated using IRT. These tests are the first IRT test sets to be applied to natural language processing tasks. Using IRT test sets allows for more comprehensive comparison of NLP models. Second, by modeling the difficulty of test set examples, we identify patterns that emerge when training deep neural network models that are consistent with human learning patterns. Specifically, as models are trained with larger training sets, they learn easy test set examples more quickly than hard examples. Third, we present a method for using soft labels on a subset of training data to improve deep learning model generalization. We show that fine-tuning a trained deep neural network with as little as 0.1% of the training data can improve model generalization in terms of test set accuracy. Fourth, we propose a new method for estimating IRT example and model parameters that allows for learning parameters at a much larger scale than previously available to accommodate the large data sets required for deep learning. This allows for learning IRT models at machine learning scale, with hundreds of thousands of examples and large ensembles of machine learning models. The response patterns of machine learning models can be used to learn IRT example characteristics instead of human response patterns. Fifth, we introduce a dynamic curriculum learning process that estimates model competency during training to adaptively select training data that is appropriate for learning at the given epoch. Finally, we introduce the ComprehENotes test, the first test of EHR comprehension for humans. The test is an accurate measure for identifying individuals with low EHR note comprehension ability, and validates the effectiveness of previously self-reported patient comprehension evaluations.


Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.