COMBO: Compositional World Models for Embodied Multi-Agent Cooperation

Abstract

In this paper, we investigate the problem of embodied multi-agentcooperation, where decentralized agents must cooperate given only partialegocentric views of the world. To effectively plan in this setting, in contrastto learning world dynamics in a single-agent scenario, we must simulate worlddynamics conditioned on an arbitrary number of agents' actions given onlypartial egocentric visual observations of the world. To address this issue ofpartial observability, we first train generative models to estimate the overallworld state given partial egocentric observations. To enable accuratesimulation of multiple sets of actions on this world state, we then propose tolearn a compositional world model for multi-agent cooperation by factorizingthe naturally composable joint actions of multiple agents and compositionallygenerating the video. By leveraging this compositional world model, incombination with Vision Language Models to infer the actions of other agents,we can use a tree search procedure to integrate these modules and facilitateonline cooperative planning. To evaluate the efficacy of our methods, we createtwo challenging embodied multi-agent long-horizon cooperation tasks using theThreeDWorld simulator and conduct experiments with 2-4 agents. The results showour compositional world model is effective and the framework enables theembodied agents to cooperate efficiently with different agents across varioustasks and an arbitrary number of agents, showing the promising future of ourproposed framework. More videos can be found athttps://vis-www.cs.umass.edu/combo/.

Quick Read (beta)

loading the full paper ...