Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.
Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.
Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.
Author ORCID Identifier
https://orcid.org/0009-0007-2868-4204
AccessType
Open Access Dissertation
Document Type
dissertation
Degree Name
Doctor of Philosophy (PhD)
Degree Program
Computer Science
Year Degree Awarded
2024
Month Degree Awarded
February
First Advisor
Philip S. Thomas
Second Advisor
Bruno Castro da Silva
Third Advisor
Scott Niekum
Fourth Advisor
Weibo Gong
Subject Categories
Artificial Intelligence and Robotics
Abstract
Policy gradient methods are a class of reinforcement learning algorithms that optimize a parametric policy by maximizing an objective function that directly measures the performance of the policy. Despite being used in many high-profile applications of reinforcement learning, the conventional use of policy gradient methods in practice deviates from existing theory. This thesis presents a comprehensive mathematical analysis of policy gradient methods, uncovering misconceptions and suggesting novel solutions to improve their performance. We first demonstrate that the update rule used by most policy gradient methods does not correspond to the gradient of any objective function due to the way the discount factor is applied, leading to suboptimal convergence. Subsequently, we show that even when this is taken into account, existing policy gradient algorithms are suboptimal in that they fail to eliminate several sources of variance. To address the first issue, we show that by gradually increasing the discount factor at a particular rate, we can restore the optimal convergence of policy gradient methods. To further address the issue of high variance, we propose a new value function called the posterior value function. This function leverages additional information from later in trajectories that was previously thought to introduce bias. With this function, we construct a new stochastic estimator that eliminates several sources of variance present in most policy gradient methods.
DOI
https://doi.org/10.7275/36300305
Recommended Citation
Nota, Christopher P., "Policy Gradient Methods: Analysis, Misconceptions, and Improvements" (2024). Doctoral Dissertations. 3075.
https://doi.org/10.7275/36300305
https://scholarworks.umass.edu/dissertations_2/3075
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.