Thumbnail Image


In this thesis, we propose statistical models for addressing commonly encountered data types and study designs in large epidemiologic investigations aimed at understanding the molecular basis of complex disorders. The motivating applications come from diverse disease areas in Women's Health, including the study of type II diabetes in the Women's Health Initiative (WHI), invasive breast cancer in the Nurses' Health Study and the study of the metabolomic underpinnings of cardiovascular disease in the WHI. We have also put significant effort into making the implementation of the proposed methods accessible through freely available, user-friendly software packages in R. The first chapter is motivated by the self-reported outcomes of incident diabetes that were collected periodically for approximately 160,000 women enrolled in the Women's Health Initiative (WHI). While self-reported outcomes are cost efficient, they are also subject to error. With a goal of variable selection in a high dimensional data setting, we adapt the Random Survival Forests algorithm to accommodate the characteristics of error-prone self-reports. We propose a novel likelihood-based splitting rule and associated variable selection algorithm to select the subset of relevant biomarkers that are associated with the time to event of interest. We compare the proposed methods to existing approaches in simulation studies. We apply the proposed algorithm to discover single nucleotide polymorphisms associated with incident type II diabetes risk in a dataset of 909,622 SNPs on 10,832 African American and Hispanic women. We implement the proposed algorithm in an R package icRSF. The second chapter is aimed at estimating and evaluating prediction rules in data generated in matched case-control studies that are nested within large prospective cohorts. This work is motivated by a matched case-control study nested within the Nurses' Health Study, where the goal was to determine if the inclusion of a set of seven endogenous hormone measurements will enhance the predictive ability of breast cancer risk when compared to the previously published Gail Score. For this setting, we propose an algorithm for estimating the summary index, area under the curve (AUC) corresponding to the Receiver Operating Characteristic (ROC) curve associated with a set of pre-defined covariates for predicting a binary outcome. By combining data from the parent cohort with that generated in a matched case-control study, we describe methods for estimation of the population parameters of interest and the corresponding AUC. We evaluate the bias associated with the proposed methods in simulations by considering a range of parameter settings. We illustrate the methods in the motivating study of endogenous hormones and breast cancer risk, nested within the Nurses' Health Study. The third chapter is aimed at estimating and evaluating prediction rules in high dimensional datasets generated in matched case control studies nested within large prospective cohorts. In this setting, the goals include simultaneous variable selection, estimation of a prediction rule and its corresponding summary index such as the AUC for quantifying the strength of prediction. This work is motivated by an ongoing study of metabolomics of cardiovascular disease in the WHI. Through extensive simulations, we compare three disparate variable selection procedures in conjunction with the parameter estimation and inverse probability weighted estimation of the AUC proposed in Chapter 2. We also evaluate the extent of overfitting observed when the multi-step procedure is carried out within one, two and three independent datasets. The common thread underlying all three chapters of this thesis is the development and application of statistical models useful in the study of complex disorders, with illustrative applications drawn from diverse areas of women's health.
Research Projects
Organizational Units
Journal Issue
Publisher Version
Embedded videos