Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings

Abstract

While text-to-image (T2I) generative models have become ubiquitous, they donot necessarily generate images that align with a given prompt. While previouswork has evaluated T2I alignment by proposing metrics, benchmarks, andtemplates for collecting human judgements, the quality of these components isnot systematically measured. Human-rated prompt sets are generally small andthe reliability of the ratings -- and thereby the prompt set used to comparemodels -- is not evaluated. We address this gap by performing an extensivestudy evaluating auto-eval metrics and human templates. We provide three maincontributions: (1) We introduce a comprehensive skills-based benchmark that candiscriminate models across different human templates. This skills-basedbenchmark categorises prompts into sub-skills, allowing a practitioner topinpoint not only which skills are challenging, but at what level of complexitya skill becomes challenging. (2) We gather human ratings across four templatesand four T2I models for a total of >100K annotations. This allows us tounderstand where differences arise due to inherent ambiguity in the prompt andwhere they arise due to differences in metric and model quality. (3) Finally,we introduce a new QA-based auto-eval metric that is better correlated withhuman ratings than existing metrics for our new dataset, across different humantemplates, and on TIFA160.

Quick Read (beta)

loading the full paper ...