RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Abstract

State-of-the-art large language models (LLMs) have become indispensable toolsfor various tasks. However, training LLMs to serve as effective assistants forhumans requires careful consideration. A promising approach is reinforcementlearning from human feedback (RLHF), which leverages human feedback to updatethe model in accordance with human preferences and mitigate issues liketoxicity and hallucinations. Yet, an understanding of RLHF for LLMs is largelyentangled with initial design choices that popularized the method and currentresearch focuses on augmenting those choices rather than fundamentallyimproving the framework. In this paper, we analyze RLHF through the lens ofreinforcement learning principles to develop an understanding of itsfundamentals, dedicating substantial focus to the core component of RLHF -- thereward model. Our study investigates modeling choices, caveats of functionapproximation, and their implications on RLHF training algorithms, highlightingthe underlying assumptions made about the expressivity of reward. Our analysisimproves the understanding of the role of reward models and methods for theirtraining, concurrently revealing limitations of the current methodology. Wecharacterize these limitations, including incorrect generalization, modelmisspecification, and the sparsity of feedback, along with their impact on theperformance of a language model. The discussion and analysis are substantiatedby a categorical review of current literature, serving as a reference forresearchers and practitioners to understand the challenges of RLHF and buildupon existing efforts.

Quick Read (beta)

loading the full paper ...