Offline Reinforcement Learning with Behavioral Supervisor Tuning

Abstract

Offline reinforcement learning (RL) algorithms are applied to learnperformant, well-generalizing policies when provided with a static dataset ofinteractions. Many recent approaches to offline RL have seen substantialsuccess, but with one key caveat: they demand substantial per-datasethyperparameter tuning to achieve reported performance, which requires policyrollouts in the environment to evaluate; this can rapidly become cumbersome.Furthermore, substantial tuning requirements can hamper the adoption of thesealgorithms in practical domains. In this paper, we present TD3 with BehavioralSupervisor Tuning (TD3-BST), an algorithm that trains an uncertainty model anduses it to guide the policy to select actions within the dataset support.TD3-BST can learn more effective policies from offline datasets compared toprevious methods and achieves the best performance across challengingbenchmarks without requiring per-dataset tuning.

Quick Read (beta)

loading the full paper ...