Percentile Criterion Optimization in Offline Reinforcement Learning

Abstract

In reinforcement learning, robust policies for high-stakes decision-makingproblems with limited data are usually computed by optimizing the\emph{percentile criterion}. The percentile criterion is approximately solvedby constructing an \emph{ambiguity set} that contains the true model with highprobability and optimizing the policy for the worst model in the set. Since thepercentile criterion is non-convex, constructing ambiguity sets is oftenchallenging. Existing work uses \emph{Bayesian credible regions} as ambiguitysets, but they are often unnecessarily large and result in learning overlyconservative policies. To overcome these shortcomings, we propose a novelValue-at-Risk based dynamic programming algorithm to optimize the percentilecriterion without explicitly constructing any ambiguity sets. Our theoreticaland empirical results show that our algorithm implicitly constructs muchsmaller ambiguity sets and learns less conservative robust policies.

Quick Read (beta)

loading the full paper ...