Hierarchical average reward reinforcement learning

Publication Date

2007

Journal or Book Title

JOURNAL OF MACHINE LEARNING RESEARCH

Abstract

Hierarchical reinforcement learning (HRL) is the study of mechanisms for exploiting the structure of tasks in order to learn more quickly. By decomposing tasks into subtasks, fully or partially specified subtask solutions can be reused in solving tasks at higher levels of abstraction. The theory of semi-Markov decision processes provides a theoretical basis for HRL. Several variant representational schemes based on SMDP models have been studied in previous work, all of which are based on the discrete-time discounted SMDP model. In this approach, policies are learned that maximize the long-term discounted sum of rewards. In this paper we investigate two formulations of HRL based on the average-reward SMDP model, both for discrete time and continuous time. In the average-reward model, policies are sought that maximize the expected reward per step. The two formulations correspond to two different notions of optimality that have been explored in previous work on HRL: hierarchical optimality, which corresponds to the set of optimal policies in the space defined by a task hierarchy, and a weaker local model called recursive optimality. What distinguishes the two models in the average reward framework is the optimization of subtasks. In the recursively optimal framework, subtasks are treated as continuing, and solved by finding gain optimal policies given the policies of their children. In the hierarchical optimality framework, the aim is to find a globally gain optimal policy within the space of policies defined by the hierarchical decomposition. We present algorithms that learn to find recursively and hierarchically optimal policies under discrete-time and continuous-time average reward SMDP models. We use four experimental testbeds to study the empirical performance of our proposed algorithms. The first two domains are relatively simple, and include a small autonomous guided vehicle (AGV) scheduling problem and a modified version of the well-known Taxi problem. The other two domains are larger real-world single-agent and multiagent AGV scheduling problems. We model these AGV scheduling tasks using both discrete-time and continuous-time models and compare the performance of our proposed algorithms with each other, as well as with other HRL methods and to standard Q-learning. In the large AGV domain, we also show that our proposed algorithms outperform widely used industrial heuristics, such as “first come first serve”, “highest queue first” and “nearest station first”.

Pages

2629-2669

Volume

8

This document is currently not available here.

Share

COinS