Reinforcement Learning with Non-Cumulative Objective

Abstract

In reinforcement learning, the objective is almost always defined as a\emph{cumulative} function over the rewards along the process. However, thereare many optimal control and reinforcement learning problems in variousapplication fields, especially in communications and networking, where theobjectives are not naturally expressed as summations of the rewards. In thispaper, we recognize the prevalence of non-cumulative objectives in variousproblems, and propose a modification to existing algorithms for optimizing suchobjectives. Specifically, we dive into the fundamental building block for manyoptimal control and reinforcement learning algorithms: the Bellman optimalityequation. To optimize a non-cumulative objective, we replace the originalsummation operation in the Bellman update rule with a generalized operationcorresponding to the objective. Furthermore, we provide sufficient conditionson the form of the generalized operation as well as assumptions on the Markovdecision process under which the globally optimal convergence of thegeneralized Bellman updates can be guaranteed. We demonstrate the ideaexperimentally with the bottleneck objective, i.e., the objectives determinedby the minimum reward along the process, on classical optimal control andreinforcement learning tasks, as well as on two network routing problems onmaximizing the flow rates.

Quick Read (beta)

loading the full paper ...