N-Agent Ad Hoc Teamwork

Abstract

Current approaches to learning cooperative behaviors in multi-agent settingsassume relatively restrictive settings. In standard fully cooperativemulti-agent reinforcement learning, the learning algorithm controls\textit{all} agents in the scenario, while in ad hoc teamwork, the learningalgorithm usually assumes control over only a $\textit{single}$ agent in thescenario. However, many cooperative settings in the real world are much lessrestrictive. For example, in an autonomous driving scenario, a company mighttrain its cars with the same learning algorithm, yet once on the road, thesecars must cooperate with cars from another company. Towards generalizing theclass of scenarios that cooperative learning methods can address, we introduce$N$-agent ad hoc teamwork, in which a set of autonomous agents must interactand cooperate with dynamically varying numbers and types of teammates atevaluation time. This paper formalizes the problem, and proposes the$\textit{Policy Optimization with Agent Modelling}$ (POAM) algorithm. POAM is apolicy gradient, multi-agent reinforcement learning approach to the NAHTproblem, that enables adaptation to diverse teammate behaviors by learningrepresentations of teammate behaviors. Empirical evaluation on StarCraft IItasks shows that POAM improves cooperative task returns compared to baselineapproaches, and enables out-of-distribution generalization to unseen teammates.

Quick Read (beta)

loading the full paper ...