Abstract
Diffusion-based technologies have made significant strides, particularly inpersonalized and customized facialgeneration. However, existing methods facechallenges in achieving high-fidelity and detailed identity (ID)consistency,primarily due to insufficient fine-grained control over facial areas and thelack of a comprehensive strategy for ID preservation by fully consideringintricate facial details and the overall face. To address these limitations, weintroduce ConsistentID, an innovative method crafted fordiverseidentity-preserving portrait generation under fine-grained multimodalfacial prompts, utilizing only a single reference image. ConsistentID comprisestwo key components: a multimodal facial prompt generator that combines facialfeatures, corresponding facial descriptions and the overall facial context toenhance precision in facial details, and an ID-preservation network optimizedthrough the facial attention localization strategy, aimed at preserving IDconsistency in facial regions. Together, these components significantly enhancethe accuracy of ID preservation by introducing fine-grained multimodal IDinformation from facial regions. To facilitate training of ConsistentID, wepresent a fine-grained portrait dataset, FGID, with over 500,000 facial images,offering greater diversity and comprehensiveness than existing public facialdatasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental resultssubstantiate that our ConsistentID achieves exceptional precision and diversityin personalized facial generation, surpassing existing methods in the MyStyledataset. Furthermore, while ConsistentID introduces more multimodal IDinformation, it maintains a fast inference speed during generation.