DPO Meets PPO: Reinforced Token Optimization for RLHF

Abstract

In the classical Reinforcement Learning from Human Feedback (RLHF) framework,Proximal Policy Optimization (PPO) is employed to learn from sparse,sentence-level rewards -- a challenging scenario in traditional deepreinforcement learning. Despite the great successes of PPO in the alignment ofstate-of-the-art closed-source large language models (LLMs), its open-sourceimplementation is still largely sub-optimal, as widely reported by numerousresearch studies. To address these issues, we introduce a framework that modelsRLHF problems as a Markov decision process (MDP), enabling the capture offine-grained token-wise information. Furthermore, we provide theoreticalinsights that demonstrate the superiority of our MDP framework over theprevious sentence-level bandit formulation. Under this framework, we introducean algorithm, dubbed as Reinforced Token Optimization (\texttt{RTO}), whichlearns the token-wise reward function from preference data and performs policyoptimization based on this learned token-wise reward signal. Theoretically,\texttt{RTO} is proven to have the capability of finding the near-optimalpolicy sample-efficiently. For its practical implementation, \texttt{RTO}innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO,originally derived from sparse sentence rewards, surprisingly provides us witha token-wise characterization of response quality, which is seamlesslyincorporated into our subsequent PPO training stage. Extensive real-worldalignment experiments verify the effectiveness of the proposed approach.

Quick Read (beta)

loading the full paper ...