DOMAIN: MilDly COnservative Model-BAsed OfflINe Reinforcement Learning

Abstract

Model-based reinforcement learning (RL), which learns environment model fromoffline dataset and generates more out-of-distribution model data, has becomean effective approach to the problem of distribution shift in offline RL. Dueto the gap between the learned and actual environment, conservatism should beincorporated into the algorithm to balance accurate offline data and imprecisemodel data. The conservatism of current algorithms mostly relies on modeluncertainty estimation. However, uncertainty estimation is unreliable and leadsto poor performance in certain scenarios, and the previous methods ignoredifferences between the model data, which brings great conservatism. Therefore,this paper proposes a milDly cOnservative Model-bAsed offlINe RL algorithm(DOMAIN) without estimating model uncertainty to address the above issues.DOMAIN introduces adaptive sampling distribution of model samples, which canadaptively adjust the model data penalty. In this paper, we theoreticallydemonstrate that the Q value learned by the DOMAIN outside the region is alower bound of the true Q value, the DOMAIN is less conservative than previousmodel-based offline RL algorithms and has the guarantee of security policyimprovement. The results of extensive experiments show that DOMAIN outperformsprior RL algorithms on the D4RL dataset benchmark, and achieves betterperformance than other RL algorithms on tasks that require generalization.

Quick Read (beta)

loading the full paper ...