Doctoral Dissertations

Off-campus UMass Amherst users: To download campus access dissertations, please use the following link to log into our proxy server with your UMass Amherst user name and password.

Non-UMass Amherst users: Please talk to your librarian about requesting this dissertation through interlibrary loan.

Dissertations that have an embargo placed on them will not be available to anyone until the embargo expires.

Policy Gradient Methods: Analysis, Misconceptions, and Improvements

Christopher P. Nota, University of Massachusetts AmherstFollow

Author ORCID Identifier

https://orcid.org/0009-0007-2868-4204

AccessType

Open Access Dissertation

Document Type

dissertation

Degree Name

Doctor of Philosophy (PhD)

Degree Program

Computer Science

Year Degree Awarded

2024

Month Degree Awarded

February

First Advisor

Philip S. Thomas

Second Advisor

Bruno Castro da Silva

Third Advisor

Scott Niekum

Fourth Advisor

Weibo Gong

Subject Categories

Artificial Intelligence and Robotics

Abstract

Policy gradient methods are a class of reinforcement learning algorithms that optimize a parametric policy by maximizing an objective function that directly measures the performance of the policy. Despite being used in many high-profile applications of reinforcement learning, the conventional use of policy gradient methods in practice deviates from existing theory. This thesis presents a comprehensive mathematical analysis of policy gradient methods, uncovering misconceptions and suggesting novel solutions to improve their performance. We first demonstrate that the update rule used by most policy gradient methods does not correspond to the gradient of any objective function due to the way the discount factor is applied, leading to suboptimal convergence. Subsequently, we show that even when this is taken into account, existing policy gradient algorithms are suboptimal in that they fail to eliminate several sources of variance. To address the first issue, we show that by gradually increasing the discount factor at a particular rate, we can restore the optimal convergence of policy gradient methods. To further address the issue of high variance, we propose a new value function called the posterior value function. This function leverages additional information from later in trajectories that was previously thought to introduce bias. With this function, we construct a new stochastic estimator that eliminates several sources of variance present in most policy gradient methods.

DOI

https://doi.org/10.7275/36300305

Recommended Citation

Nota, Christopher P., "Policy Gradient Methods: Analysis, Misconceptions, and Improvements" (2024). Doctoral Dissertations. 3075.
https://doi.org/10.7275/36300305 https://scholarworks.umass.edu/dissertations_2/3075

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

ScholarWorks@UMass Amherst

Doctoral Dissertations

Policy Gradient Methods: Analysis, Misconceptions, and Improvements

Author ORCID Identifier

AccessType

Document Type

Degree Name

Degree Program

Year Degree Awarded

Month Degree Awarded

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Subject Categories

Abstract

DOI

Recommended Citation

Creative Commons License

Included in

Browse

Author Corner

Links

ScholarWorks@UMass Amherst

Doctoral Dissertations

Policy Gradient Methods: Analysis, Misconceptions, and Improvements

Author

Author ORCID Identifier

AccessType

Document Type

Degree Name

Degree Program

Year Degree Awarded

Month Degree Awarded

First Advisor

Second Advisor

Third Advisor

Fourth Advisor

Subject Categories

Abstract

DOI

Recommended Citation

Creative Commons License

Included in

Share

Browse

Author Corner

Links