A Survey of Reinforcement Learning from Human Feedback

Abstract

Reinforcement learning from human feedback (RLHF) is a variant ofreinforcement learning (RL) that learns from human feedback instead of relyingon an engineered reward function. Building on prior work on the related settingof preference-based reinforcement learning (PbRL), it stands at theintersection of artificial intelligence and human-computer interaction. Thispositioning offers a promising avenue to enhance the performance andadaptability of intelligent systems while also improving the alignment of theirobjectives with human values. The training of large language models (LLMs) hasimpressively demonstrated this potential in recent years, where RLHF played adecisive role in directing the model's capabilities toward human objectives.This article provides a comprehensive overview of the fundamentals of RLHF,exploring the intricate dynamics between RL agents and human input. Whilerecent focus has been on RLHF for LLMs, our survey adopts a broaderperspective, examining the diverse applications and wide-ranging impact of thetechnique. We delve into the core principles that underpin RLHF, shedding lighton the symbiotic relationship between algorithms and human feedback, anddiscuss the main research trends in the field. By synthesizing the currentlandscape of RLHF research, this article aims to provide researchers as well aspractitioners with a comprehensive understanding of this rapidly growing fieldof research.

Quick Read (beta)

loading the full paper ...