Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Author ORCID Identifier


Open Access Dissertation

Document Type


Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded


Month Degree Awarded


First Advisor

David Jensen

Second Advisor

Ben Marlin

Third Advisor

Madalina Fiterau

Fourth Advisor

Justin Gross

Subject Categories

Computer Sciences


Causal modeling is central to many areas of artificial intelligence, including complex reasoning, planning, knowledge-base construction, robotics, explanation, and fairness. Active communities of researchers in machine learning, statistics, social science, and other fields develop and enhance algorithms that learn causal models from data, and this work has produced a series of impressive technical advances. However, evaluation techniques for causal modeling algorithms have remained somewhat primitive, limiting what we can learn from the experimental studies of algorithm performance, constraining the types of algorithms and model representations that researchers consider, and creating a gap between theory and practice. We argue for expanding the standard techniques for evaluating algorithms that construct causal models. Specifically, we argue for the addition of evaluation techniques that use interventional measures rather than structural or observational measures, and that evaluate with those measures on empirical data rather than synthetic data. We survey the current practice in evaluation and show that, while the evaluation techniques we advocate are rarely used in practice, they are feasible and produce substantially different results than using structural measures and synthetic data. We also provide a protocol for generating observational-style data sets from experimental data, allowing the creation of a large number of data sets suitable for evaluation of causal modeling algorithms. We then perform a large-scale evaluation of seven causal modeling methods over 37 data sets, drawn from randomized controlled trials, as well as simulators, real-world computational systems, and observational data sets augmented with a synthetic response variable. We find notable performance differences when comparing across data from different sources. This difference demonstrates the importance of using data from a variety of sources when evaluating any causal modeling methods.