Loading...
Citations
Altmetric:
Abstract
This dissertation proposes and presents solutions to two new problems that fall within the broad scope of reinforcement learning (RL) research. The first problem, high confidence off-policy evaluation (HCOPE), requires an algorithm to use historical data from one or more behavior policies to compute a high confidence lower bound on the performance of an evaluation policy. This allows us to, for the first time, provide the user of any RL algorithm with confidence that a newly proposed policy (which has never actually been used) will perform well. The second problem is to construct what we call a safe reinforcement learning algorithm---an algorithm that searches for new and improved policies, while ensuring that the probability that a "bad" policy is proposed is low. Importantly, the user of the RL algorithm may tune the meaning of "bad" (in terms of a desired performance baseline) and how low the probability of a bad policy being deployed should be, in order to capture the level of risk that is acceptable for the application at hand. We show empirically that our solutions to these two critical problems require surprisingly little data, making them practical for real problems. While our methods allow us to, for the first time, produce convincing statistical guarantees about the performance of a policy without requiring its execution, the primary contribution of this dissertation is not the methods that we propose. The primary contribution of this dissertation is a compelling argument that these two problems, HCOPE and safe reinforcement learning, which at first may seem out of reach, are actually tractable. We hope that this will inspire researchers to propose their own methods, which improve upon our own, and that the development of increasingly data-efficient safe reinforcement learning algorithms will catalyze the widespread adoption of reinforcement learning algorithms for suitable real-world problems.
Type
Dissertation (Open Access)
Date
2015-09