Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Author ORCID Identifier


Campus-Only Access for Five (5) Years

Document Type


Degree Name

Doctor of Philosophy (PhD)

Degree Program


Year Degree Awarded


Month Degree Awarded


First Advisor

Anna Liu

Second Advisor

Markos A. Katsoulakis

Subject Categories

Applied Statistics | Artificial Intelligence and Robotics


Active learning is a machine learning technique in which a learning algorithm is able to interactively query the information source to obtain the desired outputs at new data points. It is closely related to the optimal experimental design in the statistical literature, with different goals. This dissertation aims to tackle a few challenging problems in active learning: class imbalance, noise features, abundance unlabeled data, large database initial sampling, and query strategy optimization. There are three topics covered.

The first topic concerns the study of active learning on partially labeled imbalanced data with noise features. The main objectives are feature selection on small data set with class imbalance, overcoming the cold start problem on extremely imbalanced data and classification on extremely imbalanced data with noise features. Here, we propose an approach that adapts the traditional oversampling techniques into the active learning framework, combined with feature selection techniques and a novel minority-class biased sampling strategy. When dealing with imbalanced data, traditional data re-sampling approaches usually require the labels of the observations from the minority class and therefore not directly applicable in the active learning framework due to a small amount of labeled instances and abundant unlabeled data. Moreover, the traditional data re-sampling approaches often assume the class distribution to be consistent with the data distribution, which is not always true when there exist noise features in the data. The proposed active oversampling approach, however, does not require the underlying label distribution of data to be known in advance. Equipped with informative feature selection strategy, the proposed active oversampling approach is able to effectively remove noise features before sampling and with our innovative active learning querying strategy, which balances the trade-offs between data imbalance and data informativeness according to the minority rate of the training set, our proposed approach is especially useful to alleviate cold start problem.

The second topic concerns sampling on large-scale data and creating an informative pool for active learning. Traditional active learning which combines supervised learning algorithms with interactive querying procedure does not scale well to large data sets and often runs into a cold start. In order to keep the most informative learning data while maximally reduce data complexity, we design a novel sampling method for initialization of the active learning process and data space reduction on unlabeled data. Assuming the smoothness conditions which are to be defined later, but essentially means data labels are consistent with data clusters in lower-dimensional space, the proposed sampling approach applies a density-based clustering algorithm to divide the original data space into clusters of various sizes and shapes. Then based on the density measurement of data instances in each cluster, the central core points with the highest densities are selected and queried as the most representative samples of the associated cluster. Following the same idea, exemplar points and border points of each cluster are kept as candidates for future interactive data exploitation and exploration. Finally, we refine the candidate unlabeled data according to the measure of model entropy reduction. In the case that the data structure is not consistent with the underlying class distribution, state-of-art deep data embedding techniques can be applied to enhance smoothness assumptions. The proposed method is able to efficiently retrieve the most representative labeled instances from each class. Moreover, by getting rid of both outliers and less informative data points, the proposed sampling approach can filter out less influential data and effectively reduce overall computational costs for active learning.

The third topic targets on combining deep learning and active learning. One of the issues of deep learning models is that sufficiently large amount of well-annotated training data must be available in order to properly train a deep neural network. We propose a deep active learning approach that utilizes active learning to guide deep neural network algorithms to adaptively select training data. Active learning is able to start with a small labeled training set and iteratively propagate the training set with the most informative samples. By embedding the deep neural network into the active learning framework, we are able to intelligently select the best training set for the model. And as a result, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning. Unlike other existing approaches, the proposed deep active learning framework uses an approximate margin-based annotation query strategy. Through gradient-based adversarial attacks, the proposed margin-based annotation strategy queries unlabeled samples based on the distance between each sample and the nearest adversarial example. Due to the fact that the decision boundary of the model is often intractable, especially for neural networks, a traditional margin-based approach which measures the exact distance between the sample data and the decision boundary is not applicable. One common alternative is to approximate the exact distance to the distance between nearest neighbors from different classes. However, it is often criticized by its coarse nature and high computational expense. The proposed approximate distance, on the other hand, can effectively reduce the computation complexity while maintaining accuracy. We adopt state-of-art gradient descent optimization algorithms into gradient-based adversarial attack methods, including iterative gradient-based approach with momentum and Nesterov accelerated gradient. Furthermore, we embed the aforementioned adversarial attack methods into an active learning framework and propose a new margin-based active annotation strategy. To allow assessing and querying samples under uncertainty a new entropy-based loss function is defined for the gradient-based adversarial attack methods.