Loading...
Citations
Altmetric:
Abstract
Reinforcement learning (RL) has emerged as a general-purpose technique for addressing problems involving sequential decision-making. However, most RL methods are based upon the fundamental assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process is stationary. This limits the applicability of such RL methods because real-world problems are often subject to changes due to external factors (\textit{passive} non-stationarity), or changes induced by interactions with the system itself (\textit{active} non-stationarity), or both (\textit{hybrid} non-stationarity). For example, personalized automated healthcare systems and other automated human-computer interaction systems need to constantly account for changes in human behavior and interests that occur over time. Further, when the stakes associated with financial risks or human life are high, the cost associated with a false stationarity assumption may be unacceptable. In this work, we address several challenges underlying (off-policy) policy evaluation, improvement, and safety amidst such non-stationarities. Our approach merges ideas from reinforcement learning, counterfactual reasoning, and time-series analysis. When the stationarity assumption is violated, using existing algorithms may result in a performance lag and false safety guarantees. This raises the question: how can we use historical data to optimize for future scenarios? To address this challenges in the presence of \textit{passive} non-stationarity, we show how future performance of a policy can be \textit{evaluated} using a forecast obtained by fitting a curve to counter-factual estimates of policy performances over time, without ever directly modeling the underlying non-stationarity. We show that this approach further enables policy \textit{improvement} to proactively search for a good future policy by leveraging a policy gradient algorithm that maximizes a forecast of future performance. Building upon these advances, we present a Seldonian algorithm that provides the first steps towards ensuring safety, with high confidence, for smoothly-varying non-stationary decision problems. The presence of \textit{active} and \textit{hybrid} non-stationarity pose additional challenges by exposing a completely new feedback loop that allows an agent to potentially control the non-stationary aspects of the environment. This makes the outcomes of future decisions dependent on all of the past interactions, thereby resulting in effectively a \textit{single} lifelong sequence of decisions. We propose a method that provides the first steps towards a general procedure for on-policy and off-policy evaluation amidst structured changes due to active, passive, or hybrid non-stationarity.
Type
campusfive
dissertation
dissertation
Date
2022-09
Publisher
Degree
License
License
http://creativecommons.org/licenses/by/4.0/