Doctoral Dissertations

Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Incremental Non-Greedy Clustering at Scale

Nicholas Monath, University of Massachusetts AmherstFollow

Author ORCID Identifier

https://orcid.org/0000-0002-5135-2423

AccessType

Open Access Dissertation

Document Type

dissertation

Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded

2022

Month Degree Awarded

February

First Advisor

Andrew McCallum

Subject Categories

Artificial Intelligence and Robotics

Abstract

Clustering is the task of organizing data into meaningful groups. Modern clustering applications such as entity resolution put several demands on clustering algorithms: (1) scalability to massive numbers of points as well as clusters, (2) incremental additions of data, (3) support for any user-specified similarity functions. Hierarchical clusterings are often desired as they represent multiple alternative flat clusterings (e.g., at different granularity levels). These tree-structured clusterings provide for both fine-grained clusters as well as uncertainty in the presence of newly arriving data. Previous work on hierarchical clustering does not fully address all three of the aforementioned desiderata. Work on incremental hierarchical clustering often makes greedy, irrevocable clustering decisions that are regretted in the presence of future data. Work on scalable hierarchical clustering does not support incremental additions or deletions. These methods often make requirements on the similarity functions used and/or empirically tend to over merge clusters, which can lead to inaccurate clusterings. In this thesis, we present incremental and scalable methods for hierarchical clustering to empirically satisfy the above desiderata. Our work aims to represent uncertainty and meaningful alternative clusterings, to efficiently reconsider past decisions in the incremental case, and to use parallelism to scale to massive datasets. Our method, Grinch, handles incrementally arriving data in a non-greedy fashion, by reconsidering past decisions using tree structure re-arrangements (e.g., rotations and grafts) invoked in accordance with the user’s specified similarity function. To achieve scalability to massive datasets, our method, SCC, builds a hierarchical clusterings in a level-wise bottom-up manner. Certain clustering decisions are made independently in parallel within each level, and a global similarity threshold schedule prevents greedy over-merging. We show how SCC can be combined with the tree-structure re-arrangements in Grinch to form a mini-batch algorithm achieving both scalable and incremental performance. Lastly, we generalize our hierarchical clustering approaches to DAG-structured ones, which can better represent uncertainty in clustering by representing overlapping clusters. We introduce an efficient bottom-up method for DAG-structured clustering, Llama. For each of the proposed methods, we provide both a theoretical and empirical analysis. Empirically, our methods achieve state-of-the-art results on clustering benchmarks in both the batch and the incremental settings, including multiple point improvements in dendrogram purity and scalability to billions of points.

DOI

https://doi.org/10.7275/26906986

Recommended Citation

Monath, Nicholas, "Incremental Non-Greedy Clustering at Scale" (2022). Doctoral Dissertations. 2427.
https://doi.org/10.7275/26906986 https://scholarworks.umass.edu/dissertations_2/2427

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

ScholarWorks@UMass Amherst

Doctoral Dissertations

Incremental Non-Greedy Clustering at Scale

Author ORCID Identifier

AccessType

Document Type

Degree Name

Degree Program

Year Degree Awarded

Month Degree Awarded

First Advisor

Subject Categories

Abstract

DOI

Recommended Citation

Creative Commons License

Included in

Browse

Author Corner

Links

ScholarWorks@UMass Amherst

Doctoral Dissertations

Incremental Non-Greedy Clustering at Scale

Author

Author ORCID Identifier

AccessType

Document Type

Degree Name

Degree Program

Year Degree Awarded

Month Degree Awarded

First Advisor

Subject Categories

Abstract

DOI

Recommended Citation

Creative Commons License

Included in

Share

Browse

Author Corner

Links