Dense Text-to-Image Generation with Attention Modulation

Yunji Kim, Jiyoung Lee, Jin Hwa Kim, Jung Woo Ha, Jun Yan Zhu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

92 Scopus citations

Abstract

Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions. Code and data are available at https://github.com/naver-ai/DenseDiffusion.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages7667-7677
Number of pages11
ISBN (Electronic)9798350307184
DOIs
StatePublished - 2023
Event2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, France
Duration: 2 Oct 20236 Oct 2023

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
ISSN (Print)1550-5499

Conference

Conference2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Country/TerritoryFrance
CityParis
Period2/10/236/10/23

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

Fingerprint

Dive into the research topics of 'Dense Text-to-Image Generation with Attention Modulation'. Together they form a unique fingerprint.

Cite this