RLRF:Reinforcement Learning from Reflection through Debates as Feedback for Bias Mitigation in LLMs

Abstract

Biases and stereotypes in Large Language Models (LLMs) can have negativeimplications for user experience and societal outcomes. Current approaches tobias mitigation like Reinforcement Learning from Human Feedback (RLHF) rely oncostly manual feedback. While LLMs have the capability to understand logic andidentify biases in text, they often struggle to effectively acknowledge andaddress their own biases due to factors such as prompt influences, internalmechanisms, and policies. We found that informing LLMs that the content theygenerate is not their own and questioning them about potential biases in thetext can significantly enhance their recognition and improvement capabilitiesregarding biases. Based on this finding, we propose RLRF (ReinforcementLearning from Reflection through Debates as Feedback), replacing human feedbackwith AI for bias mitigation. RLRF engages LLMs in multi-role debates to exposebiases and gradually reduce biases in each iteration using a ranking scoringmechanism. The dialogue are then used to create a dataset with high-bias andlow-bias instances to train the reward model in reinforcement learning. Thisdataset can be generated by the same LLMs for self-reflection or a superiorLLMs guiding the former in a student-teacher mode to enhance its logicalreasoning abilities. Experimental results demonstrate the significanteffectiveness of our approach in bias reduction.

Quick Read (beta)

loading the full paper ...