Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Author ORCID Identifier


Open Access Dissertation

Document Type


Degree Name

Doctor of Philosophy (PhD)

Degree Program

Public Health

Year Degree Awarded


Month Degree Awarded


First Advisor

Laura B. Balzer

Second Advisor

Raji Balasubramanian

Third Advisor

Iván Díaz

Subject Categories

Biostatistics | Data Science | Environmental Public Health | Epidemiology


Many questions in public health and medicine are fundamentally causal in that our objective is to learn the effect of some exposure, randomized or not, on an outcome of interest. As a result, causal inference frameworks and methodologies have gained interest as a promising tool to reliably answer scientific questions. However, the tasks of identifying and efficiently estimating causal effects from observed data still pose significant challenges under complex data generating scenarios. We focus on (1) high-dimensional settings where the number of variables is orders of magnitude higher than the number of observations; and (2) multi-level settings, where study participants are grouped into clusters and the exposure is assigned at the cluster level.

First, we propose a novel adaptation of the Super Learner algorithm for the task of feature selection in high-dimensional settings. In simulations and with real data, we demonstrate that our proposed approach improves the accuracy for identifying potential causes of a target variable by using a novel measure of variable importance, and by combining a library of feature selection algorithms.

Second, we consider the task of estimating ‘biological age’ from a set of age-dependent variables of potentially high dimensions (e.g., -omics). We propose a new method for calculating biological age that is based on an adaptation of the algorithm presented in chapter 2. Then, we develop an approach to evaluate, compare, and combine different approaches to biological age estimation with the goal of constructing age-related disease risk scores which could potentially aide in diagnosis and prognosis of age-related diseases.

Third, we turn our attention to causal mediation analysis in a multi-level setting where the exposure is assigned at the cluster level, but the mediator and outcomes are measured at the participant level. We extend the general hierarchical causal model to include mediating variables. We adapt the mediation effects that arise from the population intervention effect (PIE) via stochastic interventions on the exposure to the multi-level setting.


Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.