MM-PhyRLHF: Reinforcement Learning Framework for Multimodal Physics Question-Answering

Abstract

Recent advancements in LLMs have shown their significant potential in taskslike text summarization and generation. Yet, they often encounter difficultywhile solving complex physics problems that require arithmetic calculation anda good understanding of concepts. Moreover, many physics problems includeimages that contain important details required to understand the problem'scontext. We propose an LMM-based chatbot to answer multimodal physics MCQs. Fordomain adaptation, we utilize the MM-PhyQA dataset comprising Indian highschool-level multimodal physics problems. To improve the LMM's performance, weexperiment with two techniques, RLHF (Reinforcement Learning from HumanFeedback) and Image Captioning. In image captioning, we add a detailedexplanation of the diagram in each image, minimizing hallucinations and imageprocessing errors. We further explore the integration of Reinforcement Learningfrom Human Feedback (RLHF) methodology inspired by the ranking approach in RLHFto enhance the human-like problem-solving abilities of the models. The RLHFapproach incorporates human feedback into the learning process of LLMs,improving the model's problem-solving skills, truthfulness, and reasoningcapabilities, minimizing the hallucinations in the answers, and improving thequality instead of using vanilla-supervised fine-tuned models. We employ theLLaVA open-source model to answer multimodal physics MCQs and compare theperformance with and without using RLHF.

Quick Read (beta)

loading the full paper ...