Date of Award


Document Type

Open Access Dissertation

Degree Name

Doctor of Philosophy (PhD)

Degree Program


First Advisor

Andrea S. Foulkes

Second Advisor

Raji Balasubramanian

Third Advisor

Rongheng Lin

Subject Categories



Due to recent advances in technology that facilitate acquisition of multi-parameter defined phenotypes, new opportunities have arisen for predicting patient outcomes based on individual specific cell subset changes. The data resulting from these trials can be a challenge to analyze, as predictors may be highly correlated with each other or related to outcome within levels of other predictor variables. As a result, applying traditional methods like simple linear models and univariate approaches such as odds ratios may be insufficient. In this dissertation, we describe potential solutions including tree-based methods, ridge regression, mixed modeling, and a new estimator called a mixed ridge estimator with expectation-maximization (EM) algorithm. Data examples are provided. In particular, flow cytometry is a method of measuring a large number of particle counts at once by suspending them in a fluid and shining a beam of light onto the fluid. This is specifically relevant in the context of studying human immunodeficiency virus (HIV), where there exists a great potential to draw from the rich array of data on host cell-mediated response to infection and drug exposures, to inform and discover patient level determinants of disease progression and/or response to anti-retroviral therapy (ART). The data sets collected are often high dimensional with correlated columns, which can be challenging to analyze. We demonstrate the application and comparative interpretations of three tree-based algorithms for the analysis of data arising from flow cytometry in the first chapter of this manuscript. Specifically, we consider the question of what best predicts CD4 T-cell recovery in HIV-1 infected persons starting antiretroviral therapy with CD4 count between 200-350 cell/μl. The tree-based approaches, namely, classification and regression trees (CART), random forests (RF) and logic regression (LR), were designed specifically to uncover complex structure in high dimensional data settings. While contingency table analysis and RFs provide information on the importance of each potential predictor variable, CART and LR offer additional insight into the combinations of variables that together are predictive of the outcome. Specifically, application of tree-based methods to our data suggest that a combination of baseline immune activation states, with emphasis on CD8 T cell activation, may be a better predictor than any single T cell/innate cell subset analyzed. In the following chapter, tree-based methods are compared to each other via a simulation study. Each has its merits in particular circumstances; for example, RF is able to identify the order of importance of predictors regardless of whether there is a tree-like structure. It is able to adjust for correlation among predictors by using a machine learning algorithm, analyzing subsets of predictors and subjects over a number of iterations. CART is useful when variables are predictive of outcome within levels of other variables, and is able to find the most parsimonious model using pruning. LR also identifies structure within the set of predictor variables, and nicely illustrates relationship among variables. However, due to the vast number of combinations of predictor variables that would need to be analyzed in order to find the single best LR tree, an algorithm is used that only searches a subset of potential combinations of predictors. Therefore, results may be different each time the algorithm is used on the same data set. Next we use a regression approach to analyzing data with correlated predictors. Ridge regression is a method of accounting for correlated data by adding a shrinkage component to the estimators for a linear model. We perform a simulation study to compare ridge regression to linear regression over various correlation coefficients and find that ridge regression outperforms linear regression as correlation increases. To account for collinearity among the predictors along with longitudinal data, a new estimator that combines the applicability of ridge regression and mixed models using an EM algorithm is developed and compared to the mixed model. We find from a simulation study comparing our mixed ridge (MR) approach with a traditional mixed model that our new mixed ridge estimator is able to handle collinearity of predictor variables better than the mixed model, while accounting for random within-subject effects that regular ridge regression does not take into account. As correlation among predictors increases, power decreases more quickly for the mixed model than MR. Additionally, type I error rate is not significantly elevated when the MR approach is taken. The MR estimator gives us new insight into flow cytometry data and other data sets with correlated predictor variables that our tree-based methods could not give us. These methods all provide unique insight into our data that more traditional methods of analysis do not offer.

Included in

Biostatistics Commons