Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Author ORCID Identifier


Open Access Dissertation

Document Type


Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded


Month Degree Awarded


First Advisor

Erik Learned-Miller

Subject Categories

Artificial Intelligence and Robotics | Computer Sciences


Human faces represent not only a challenging recognition problem for computer vision, but are also an important source of information about identity, intent, and state of mind. These properties make the analysis of faces important not just as algorithmic challenges, but as a gateway to developing computer vision methods that can better follow the intent and goals of human beings. In this thesis, we are interested in face clustering in videos. Given a raw video, with no caption or annotation, we want to group all detected faces by their identity. We address three problems in the area of face clustering and propose approaches to tackle them. The existing link-based face-clustering system is sensitive to a false connection between two different people. We introduce a new similarity measure that helps the verification system to provide very few false connections at moderate recall. Further, we also introduce a novel clustering method called Erdos and Renyi clustering, which is based on the observations from a random graph model theory, that large clusters can be fully connected by joining just a small fraction of their node pairs. Our results present state-of-the-art results on multiple video data sets and also on standard face databases. What happens if faces are not sufficiently clear for direct recognition, due to the small scale, occlusion, or extreme pose? We observe that, when humans are uncertain about the identity of two faces, we use clothes or other contextual cues, e.g. specific objects or textures, to infer identity. With this observation, we propose the Face-Background Network (FB-Net), which takes as input not only the faces but also the entire scene to enhance the performance of face clustering. In order for the network to learn background features that are informative about the identity, we introduce a new dataset that contains face identities in the context of consistent scenes. We show that FB-Net outperforms the state-of-the-art method which uses face-level features only for the task of video face clustering. The performance of face clustering depends on a good face detector. However, improving the performance of a face detector requires expensive labeling of faces. In this work, we propose an approach to reduce mistakes of the existing face detector by using many hours of freely available unlabeled videos on the web. Specifically, with the observation that false positives/negatives are often isolated in time, we demonstrate a method to mine hard examples automatically using temporal continuity in videos. In particular, we analyze the output of a trained detector on video sequences and mine detections that are isolated in time, which is likely to be hard examples. Our experiments show that re-training detectors on these automatically obtained examples often significantly improves performance. We present experiments on multiple architectures and multiple data sets, including face detection, pedestrian detection, and other object categories.


Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.