Autonomous Evaluation and Refinement of Digital Agents

Abstract

We show that domain-general automatic evaluators can significantly improvethe performance of agents for web navigation and device control. We experimentwith multiple evaluation models that trade off between inference cost,modularity of design, and accuracy. We validate the performance of these modelsin several popular benchmarks for digital agents, finding between 74.4 and92.9% agreement with oracle evaluation metrics. Finally, we use theseevaluators to improve the performance of existing agents via fine-tuning andinference-time guidance. Without any additional supervision, we improvestate-of-the-art performance by 29% on the popular benchmark WebArena, andachieve a 75% relative improvement in a challenging domain transfer scenario.

Quick Read (beta)

loading the full paper ...