Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Document Type

Campus-Only Access for Five (5) Years

Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded

2015

Month Degree Awarded

September

First Advisor

Andrew G. Barto

Subject Categories

Artificial Intelligence and Robotics

Abstract

This dissertation proposes and presents solutions to two new problems that fall within the broad scope of reinforcement learning (RL) research. The first problem, high confidence off-policy evaluation (HCOPE), requires an algorithm to use historical data from one or more behavior policies to compute a high confidence lower bound on the performance of an evaluation policy. This allows us to, for the first time, provide the user of any RL algorithm with confidence that a newly proposed policy (which has never actually been used) will perform well.

The second problem is to construct what we call a safe reinforcement learning algorithm---an algorithm that searches for new and improved policies, while ensuring that the probability that a "bad" policy is proposed is low. Importantly, the user of the RL algorithm may tune the meaning of "bad" (in terms of a desired performance baseline) and how low the probability of a bad policy being deployed should be, in order to capture the level of risk that is acceptable for the application at hand.

We show empirically that our solutions to these two critical problems require surprisingly little data, making them practical for real problems. While our methods allow us to, for the first time, produce convincing statistical guarantees about the performance of a policy without requiring its execution, the primary contribution of this dissertation is not the methods that we propose. The primary contribution of this dissertation is a compelling argument that these two problems, HCOPE and safe reinforcement learning, which at first may seem out of reach, are actually tractable. We hope that this will inspire researchers to propose their own methods, which improve upon our own, and that the development of increasingly data-efficient safe reinforcement learning algorithms will catalyze the widespread adoption of reinforcement learning algorithms for suitable real-world problems.

Share

COinS