Dataset Reset Policy Optimization for RLHF

Abstract

Reinforcement Learning (RL) from Human Preference-based feedback is a popularparadigm for fine-tuning generative models, which has produced impressivemodels such as GPT-4 and Claude3 Opus. This framework often consists of twosteps: learning a reward model from an offline preference dataset followed byrunning online RL to optimize the learned reward model. In this work,leveraging the idea of reset, we propose a new RLHF algorithm with provableguarantees. Motivated by the fact that offline preference dataset providesinformative states (i.e., data that is preferred by the labelers), our newalgorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existingoffline preference dataset into the online policy training procedure viadataset reset: it directly resets the policy optimizer to the states in theoffline dataset, instead of always starting from the initial statedistribution. In theory, we show that DR-PO learns to perform at least as goodas any policy that is covered by the offline dataset under general functionapproximation with finite sample complexity. In experiments, we demonstratethat on both the TL;DR summarization and the Anthropic Helpful Harmful (HH)dataset, the generation from DR-PO is better than that from Proximal PolicyOptimization (PPO) and Direction Preference Optimization (DPO), under themetric of GPT4 win-rate. Code for this work can be found athttps://github.com/Cornell-RL/drpo.

Quick Read (beta)

loading the full paper ...