TheaterGen: Character Management with LLM for Consistent Multi-turn Image Generation

  • 2024-04-29 18:58:14
  • Junhao Cheng, Baiqiao Yin, Kaixin Cai, Minbin Huang, Hanhui Li, Yuxin He, Xi Lu, Yue Li, Yifei Li, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang
  • 0

Abstract

Recent advances in diffusion models can generate high-quality and stunningimages from text. However, multi-turn image generation, which is of high demandin real-world scenarios, still faces challenges in maintaining semanticconsistency between images and texts, as well as contextual consistency of thesame subject across multiple interactive turns. To address this issue, weintroduce TheaterGen, a training-free framework that integrates large languagemodels (LLMs) and text-to-image (T2I) models to provide the capability ofmulti-turn image generation. Within this framework, LLMs, acting as a"Screenwriter", engage in multi-turn interaction, generating and managing astandardized prompt book that encompasses prompts and layout designs for eachcharacter in the target image. Based on these, Theatergen generate a list ofcharacter images and extract guidance information, akin to the "Rehearsal".Subsequently, through incorporating the prompt book and guidance informationinto the reverse denoising process of T2I diffusion models, Theatergen generatethe final image, as conducting the "Final Performance". With the effectivemanagement of prompt books and character images, TheaterGen significantlyimproves semantic and contextual consistency in synthesized images.Furthermore, we introduce a dedicated benchmark, CMIGBench (ConsistentMulti-turn Image Generation Benchmark) with 8000 multi-turn instructions.Different from previous multi-turn benchmarks, CMIGBench does not definecharacters in advance. Both the tasks of story generation and multi-turnediting are included on CMIGBench for comprehensive evaluation. Extensiveexperimental results show that TheaterGen outperforms state-of-the-art methodssignificantly. It raises the performance bar of the cutting-edge Mini DALLE 3model by 21% in average character-character similarity and 19% in averagetext-image similarity.

 

Quick Read (beta)

loading the full paper ...