Towards Objectively Benchmarking Social Intelligence for Language Agents at Action Level

Abstract

Prominent large language models have exhibited human-level performance inmany domains, even enabling the derived agents to simulate human and socialinteractions. While practical works have substantiated the practicability ofgrounding language agents in sandbox simulation or embodied simulators, currentsocial intelligence benchmarks either stay at the language level or usesubjective metrics. In pursuit of a more realistic and objective evaluation, weintroduce the Social Tasks in Sandbox Simulation (STSS) benchmark, whichassesses language agents \textbf{objectively} at the \textbf{action level} byscrutinizing the goal achievements within the multi-agent simulation.Additionally, we sample conversation scenarios to build a language-levelbenchmark to provide an economically prudent preliminary evaluation and alignwith prevailing benchmarks. To gauge the significance of agent architecture, weimplement a target-driven planning (TDP) module as an adjunct to the existingagent. Our evaluative findings highlight that the STSS benchmark is challengingfor state-of-the-art language agents. Furthermore, it effectively discriminatesbetween distinct language agents, suggesting its usefulness as a benchmark forevaluating both language models and agent architectures.

Quick Read (beta)

loading the full paper ...