Abstract
Despite recent advances in multimodal pre-training for visual description,state-of-the-art models still produce captions containing errors, such ashallucinating objects not present in a scene. The existing prominent metric forobject hallucination, CHAIR, is limited to a fixed set of MS COCO objects andsynonyms. In this work, we propose a modernized open-vocabulary metric, ALOHa,which leverages large language models (LLMs) to measure object hallucinations.Specifically, we use an LLM to extract groundable objects from a candidatecaption, measure their semantic similarity to reference objects from captionsand object detections, and use Hungarian matching to produce a finalhallucination score. We show that ALOHa correctly identifies 13.6% morehallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCOCaptions annotated for hallucinations, and 30.8% more on nocaps, where objectsextend beyond MS COCO categories. Our code is available athttps://davidmchan.github.io/aloha/.