Loading...
Thumbnail Image
Publication

STATISTICAL METHODS TO STUDY TRANSPOSON SEQUENCING DATA: NONPARAMETRIC BAYESIAN MODELS WITH SAMPLING ALGORITHMS

Abstract
As the development of Next Generation Sequencing(NGS) technology, researchers can easily obtain data from millions of cells( bulk samples) or just collecting data from a single cell. However, while bulk samples can capture broad changes, it may risk providing an average measurement that is not representative of the genetic state of any individual cell. While single-cell experiments can capture the genetic state of the individual cell, a single cell sample can increase uncertainty, sampling enough cells to gain a representative sample of population is expensive. Therefore, there is a need to integrate information from both bulk and single-cell data to obtain a comprehensive understanding of subclonal populations in an individual tumor as well as across individuals. The goal of this work is to jointly infer the underlying genotypes of tumor subpopulations and the distribution of those subpopulations in individual tumors by integrating single-cell and bulk sequencing data. We propose a hierarchical Dirichlet process mixture model that incorporates the correlation structure induced by a structured sampling arrangement and we show that this model improves the quality of inference. We develop a representation of the hierarchical Dirichlet process prior as a Gamma-Poisson hierarchy and we use this representation to derive a fast Gibbs sampling inference algorithm using the augment-and-marginalize method. Experiments shows that our model outperforms state-of-the-art methods Another goal for analyzing genomic data is to understand which genes are essential and under what environmental conditions they are essential. Transposon sequencing method provides a powerful tool for researchers to find conditionally essential genes. However, methods are needed to go beyond a one-at-a-time analysis of conditionally essential genes and learn higher order representations that identify conditionally essential networks of genes. While the aforementioned methods do identify essential genes from transposon sequencing data, they do not provide a representation of the space of essential genes. For example, if two genes share the same pattern of essentiality across all conditions there is a higher level representation that couples those genes into a network. The goal of this work is to build such a higher level representations of the set of essential genes and identify genes that share essentiality patterns across conditions. To address this need, we develop a novel, computationally efficient hierarchical non-parametric Bayesian model: hierarchical Gamma-Poisson Process (hGP).
Type
openaccess
article
dissertation
Date
Publisher
Rights
License
Research Projects
Organizational Units
Journal Issue
Embargo
Publisher Version
Embedded videos
Collections