A Dual Perspective of Reinforcement Learning for Imposing Policy Constraints

Abstract

Model-free reinforcement learning methods lack an inherent mechanism toimpose behavioural constraints on the trained policies. While certainextensions exist, they remain limited to specific types of constraints, such asvalue constraints with additional reward signals or visitation densityconstraints. In this work we try to unify these existing techniques and bridgethe gap with classical optimization and control theory, using a genericprimal-dual framework for value-based and actor-critic reinforcement learningmethods. The obtained dual formulations turn out to be especially useful forimposing additional constraints on the learned policy, as an intrinsicrelationship between such dual constraints (or regularization terms) and rewardmodifications in the primal is reveiled. Furthermore, using this framework, weare able to introduce some novel types of constraints, allowing to imposebounds on the policy's action density or on costs associated with transitionsbetween consecutive states and actions. From the adjusted primal-dualoptimization problems, a practical algorithm is derived that supports variouscombinations of policy constraints that are automatically handled throughouttraining using trainable reward modifications. The resulting $\texttt{DualCRL}$method is examined in more detail and evaluated under different (combinationsof) constraints on two interpretable environments. The results highlight theefficacy of the method, which ultimately provides the designer of such systemswith a versatile toolbox of possible policy constraints.

Quick Read (beta)

loading the full paper ...