Large Language Models as Generalizable Policies for Embodied Tasks

Abstract

We show that large language models (LLMs) can be adapted to be generalizablepolicies for embodied visual tasks. Our approach, called Large LAnguage modelReinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to takeas input text instructions and visual egocentric observations and outputactions directly in the environment. Using reinforcement learning, we trainLLaRP to see and act solely through environmental interactions. We show thatLLaRP is robust to complex paraphrasings of task instructions and cangeneralize to new tasks that require novel optimal behavior. In particular, on1,000 unseen tasks it achieves 42% success rate, 1.7x the success rate of othercommon learned baselines or zero-shot applications of LLMs. Finally, to aid thecommunity in studying language conditioned, massively multi-task, embodied AIproblems we release a novel benchmark, Language Rearrangement, consisting of150,000 training and 1,000 testing tasks for language-conditionedrearrangement. Video examples of LLaRP in unseen Language Rearrangementinstructions are at https://llm-rl.github.io.

Quick Read (beta)

loading the full paper ...