A Multimodal Automated Interpretability Agent

Abstract

This paper describes MAIA, a Multimodal Automated Interpretability Agent.MAIA is a system that uses neural models to automate neural model understandingtasks like feature interpretation and failure mode discovery. It equips apre-trained vision-language model with a set of tools that support iterativeexperimentation on subcomponents of other models to explain their behavior.These include tools commonly used by human interpretability researchers: forsynthesizing and editing inputs, computing maximally activating exemplars fromreal-world datasets, and summarizing and describing experimental results.Interpretability experiments proposed by MAIA compose these tools to describeand explain system behavior. We evaluate applications of MAIA to computervision models. We first characterize MAIA's ability to describe (neuron-level)features in learned representations of images. Across several trained modelsand a novel dataset of synthetic vision neurons with paired ground-truthdescriptions, MAIA produces descriptions comparable to those generated byexpert human experimenters. We then show that MAIA can aid in two additionalinterpretability tasks: reducing sensitivity to spurious features, andautomatically identifying inputs likely to be mis-classified.

Quick Read (beta)

loading the full paper ...