Weak-to-Strong Extrapolation Expedites Alignment

Abstract

Although the capabilities of large language models (LLMs) ideally scale upwith increasing data and compute, they are inevitably constrained by limitedresources in reality. Suppose we have a moderately trained LLM (e.g., trainedto align with human preference) in hand, can we further exploit its potentialand cheaply acquire a stronger model? In this paper, we propose a simple methodcalled ExPO to boost LLMs' alignment with human preference. ExPO assumes that amedium-aligned model can be interpolated between a less-aligned (weaker) model,e.g., the initial SFT model, and a better-aligned (stronger) one, therebydirectly obtaining this stronger model by extrapolating from the weights of theformer two relatively weaker models. On the AlpacaEval 2.0 benchmark, we showthat ExPO pushes models trained with less preference data (e.g., 10% or 20%) toreach and even surpass the fully-trained one, without any additional training.Furthermore, ExPO also significantly improves off-the-shelf DPO/RLHF models andexhibits decent scalability across model sizes from 7B to 70B. Our workdemonstrates the efficacy of model extrapolation in exploiting LLMs'capabilities, suggesting a promising direction that deserves futureexploration.

Quick Read (beta)

loading the full paper ...