Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning

Abstract

Offline Reinforcement Learning (RL) faces distributional shift and unreliablevalue estimation, especially for out-of-distribution (OOD) actions. To addressthis, existing uncertainty-based methods penalize the value function withuncertainty quantification and demand numerous ensemble networks, posingcomputational challenges and suboptimal outcomes. In this paper, we introduce anovel strategy employing diverse randomized value functions to estimate theposterior distribution of $Q$-values. It provides robust uncertaintyquantification and estimates lower confidence bounds (LCB) of $Q$-values. Byapplying moderate value penalties for OOD actions, our method fosters aprovably pessimistic approach. We also emphasize on diversity within randomizedvalue functions and enhance efficiency by introducing a diversityregularization method, reducing the requisite number of networks. These moduleslead to reliable value estimation and efficient policy learning from offlinedata. Theoretical analysis shows that our method recovers the provablyefficient LCB-penalty under linear MDP assumptions. Extensive empirical resultsalso demonstrate that our proposed method significantly outperforms baselinemethods in terms of performance and parametric efficiency.

Quick Read (beta)

loading the full paper ...