Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Author ORCID Identifier


Open Access Dissertation

Document Type


Degree Name

Doctor of Philosophy (PhD)

Degree Program


Year Degree Awarded


Month Degree Awarded


First Advisor

Patrick Flaherty

Subject Categories

Statistics and Probability


As the development of Next Generation Sequencing(NGS) technology, researchers can easily obtain data from millions of cells( bulk samples) or just collecting data from a single cell. However, while bulk samples can capture broad changes, it may risk providing an average measurement that is not representative of the genetic state of any individual cell. While single-cell experiments can capture the genetic state of the individual cell, a single cell sample can increase uncertainty, sampling enough cells to gain a representative sample of population is expensive. Therefore, there is a need to integrate information from both bulk and single-cell data to obtain a comprehensive understanding of subclonal populations in an individual tumor as well as across individuals. The goal of this work is to jointly infer the underlying genotypes of tumor subpopulations and the distribution of those subpopulations in individual tumors by integrating single-cell and bulk sequencing data. We propose a hierarchical Dirichlet process mixture model that incorporates the correlation structure induced by a structured sampling arrangement and we show that this model improves the quality of inference. We develop a representation of the hierarchical Dirichlet process prior as a Gamma-Poisson hierarchy and we use this representation to derive a fast Gibbs sampling inference algorithm using the augment-and-marginalize method. Experiments shows that our model outperforms state-of-the-art methods

Another goal for analyzing genomic data is to understand which genes are essential and under what environmental conditions they are essential. Transposon sequencing method provides a powerful tool for researchers to find conditionally essential genes. However, methods are needed to go beyond a one-at-a-time analysis of conditionally essential genes and learn higher order representations that identify conditionally essential networks of genes. While the aforementioned methods do identify essential genes from transposon sequencing data, they do not provide a representation of the space of essential genes. For example, if two genes share the same pattern of essentiality across all conditions there is a higher level representation that couples those genes into a network. The goal of this work is to build such a higher level representations of the set of essential genes and identify genes that share essentiality patterns across conditions. To address this need, we develop a novel, computationally efficient hierarchical non-parametric Bayesian model: hierarchical Gamma-Poisson Process (hGP).