Read, Watch and Scream! Sound Generation from Text and Video

Research output: Contribution to journalConference articlepeer-review

6 Scopus citations

Abstract

Despite the impressive progress of multimodal generative models, video-to-audio generation still suffers from limited performance and limits the flexibility to prioritize sound synthesis for specific objects within the scene. Conversely, text-to-audio generation methods generate high-quality audio but pose challenges in ensuring comprehensive scene depiction and time-varying control. To tackle these challenges, we propose a novel video-and-text-to-audio generation method, called ReWaS, where video serves as a conditional control for a text-to-audio generation model. Especially, our method estimates the structural information of sound (namely, energy) from the video while receiving key content cues from a user prompt. We employ a well-performing text-to-audio model to consolidate the video control, which is much more efficient for training multimodal diffusion models with massive triplet-paired (audio-video-text) data. In addition, by separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences. Experimental results demonstrate that our method shows superiority in terms of quality, controllability, and training efficiency.

Original languageEnglish
Pages (from-to)17590-17598
Number of pages9
JournalProceedings of the AAAI Conference on Artificial Intelligence
Volume39
Issue number17
DOIs
StatePublished - 11 Apr 2025
Event39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 - Philadelphia, United States
Duration: 25 Feb 20254 Mar 2025

Bibliographical note

Publisher Copyright:
Copyright © 2025, Association for the Advancement of Artificial Intelligence.

Fingerprint

Dive into the research topics of 'Read, Watch and Scream! Sound Generation from Text and Video'. Together they form a unique fingerprint.

Cite this