Abstract
Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions. Code and data are available at https://github.com/naver-ai/DenseDiffusion.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 7667-7677 |
| Number of pages | 11 |
| ISBN (Electronic) | 9798350307184 |
| DOIs | |
| State | Published - 2023 |
| Event | 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, France Duration: 2 Oct 2023 → 6 Oct 2023 |
Publication series
| Name | Proceedings of the IEEE International Conference on Computer Vision |
|---|---|
| ISSN (Print) | 1550-5499 |
Conference
| Conference | 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 |
|---|---|
| Country/Territory | France |
| City | Paris |
| Period | 2/10/23 → 6/10/23 |
Bibliographical note
Publisher Copyright:© 2023 IEEE.