Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Author ORCID Identifier

https://orcid.org/0009-0007-2868-4204

AccessType

Open Access Dissertation

Document Type

dissertation

Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded

2024

Month Degree Awarded

February

First Advisor

Philip S. Thomas

Second Advisor

Bruno Castro da Silva

Third Advisor

Scott Niekum

Fourth Advisor

Weibo Gong

Subject Categories

Artificial Intelligence and Robotics

Abstract

Policy gradient methods are a class of reinforcement learning algorithms that optimize a parametric policy by maximizing an objective function that directly measures the performance of the policy. Despite being used in many high-profile applications of reinforcement learning, the conventional use of policy gradient methods in practice deviates from existing theory. This thesis presents a comprehensive mathematical analysis of policy gradient methods, uncovering misconceptions and suggesting novel solutions to improve their performance. We first demonstrate that the update rule used by most policy gradient methods does not correspond to the gradient of any objective function due to the way the discount factor is applied, leading to suboptimal convergence. Subsequently, we show that even when this is taken into account, existing policy gradient algorithms are suboptimal in that they fail to eliminate several sources of variance. To address the first issue, we show that by gradually increasing the discount factor at a particular rate, we can restore the optimal convergence of policy gradient methods. To further address the issue of high variance, we propose a new value function called the posterior value function. This function leverages additional information from later in trajectories that was previously thought to introduce bias. With this function, we construct a new stochastic estimator that eliminates several sources of variance present in most policy gradient methods.

DOI

https://doi.org/10.7275/36300305

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS