Abstract
We study Reinforcement Learning from Human Feedback (RLHF) under a generalpreference oracle. In particular, we do not assume that there exists a rewardfunction and the preference signal is drawn from the Bradley-Terry model asmost of the prior works do. We consider a standard mathematical formulation,the reverse-KL regularized minimax game between two LLMs for RLHF under generalpreference oracle. The learning objective of this formulation is to find apolicy so that it is consistently preferred by the KL-regularized preferenceoracle over any competing LLMs. We show that this framework is strictly moregeneral than the reward-based one, and propose sample-efficient algorithms forboth the offline learning from a pre-collected preference dataset and onlinelearning where we can query the preference oracle along the way of training.Empirical studies verify the effectiveness of the proposed framework.