General Audio Synthesis

Main Contributors: Junwon Lee, Yoonjin Chung, Jaekwon Im

Foley Sound Generation

Foley sound is the replication of everyday sound effects, added to films, videos, and other media in post-production to enhance audio quality. Named after Jack Foley, a pioneering sound-effects artist, these sounds include a variety of common noises such as footsteps, the rustle of clothing, creaking doors, and breaking glass. Foley artists manually produce and synchronize these sounds with visual elements. They specialize in recreating specific sounds to augment or replace original recordings. This process occurs in specialized Foley stages or studios, equipped with props and settings designed to achieve the desired acoustics(below figure[2][3][4]).

Foley sound effects play a crucial role in enhancing the immersive experience of various media forms, such as movies, games, and virtual reality. These effects add depth and authenticity, creating a more engaging and captivating auditory experience for the audience.

The creation of Foley sounds is intricate and time-consuming, as it involves capturing numerous subtle effects within a video sequence. An automated Foley Sound Synthesis focuses on generating audio that:

Represents specific sound sources (categories) with the desired nuance and intensity.
Aligns temporally with the events or patterns in the visual sequence.
Maintains high overall quality.

For example, in a given video, the system should be able to differentiate between the sounds of a whining Chihuahua and a barking Retriever, identify whether a raindrop hits a glass window or a wooden surface, and vary the timing and volume of footsteps.

Our project centers on Controllable Foley Sound Synthesis, aimed at facilitating easy and intuitive production. The objective is to empower users with the ability to control the sound source(s) and temporal events during the neural generation process. T-Foley represents our initial advancement in this field: a framework for controllable generation featuring intuitive conditioning inputs, including the desired sound class and temporal event features (RMS). Please refer to our demo webpage.

Related Publications

T-Foley: A Controllable Waveform-domain Diffusion Model for Temporal-event-guided Foley Sound Synthesis
Yoonjin Chung*, Junwon Lee*, and Juhan Nam
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024
[paper] [code] [demo]

Organizing DCASE Challenge

We contribute to the advancement of practical research directions and the development of suitable evaluation methodologies by participating in hosting the DCASE Challenge. Our initial focus was on the "Foley Sound Synthesis" task, which aimed to generate natural and high-quality audio for specific Foley sound categories. Currently, our efforts have shifted to "Sound Scene Synthesis," focusing on integrating general sound scenes for standard sound post-production. For more information, please visit the official website.

DCASE 2023 Challenge Task7 Foley Sound Synthesis
DCASE 2024 Challenge Task7 Sound Scene Synthesis

Related Publications

Foley sound synthesis at the DCASE 2023 challenge
Keunwoo Choi*, Jaekwon Im*, Laurie Heller*, Brian McFee, Keisuke Imoto, Yuki Okamoto, Mathieu Lagrange, and Shinosuke Takamichi
Workshop on detection and classification of acoustic scenes and events (DCASE). 2023
[paper]

References

[1] Keunwoo Choi, Jaekwon Im, Laurie Heller, Brian McFee, Keisuke Imoto, Yuki Okamoto, Mathieu Lagrange, and Shinosuke Takamichi, “Foley sound synthesis at the DCASE 2023 challenge,” Workshop on detection and classification of acoustic scenes and events (DCASE). 2023.
[2] Ghose, Sanchita, and John Jeffrey Prevost. "Autofoley: Artificial synthesis of synchronized sound tracks for silent videos with deep learning." IEEE Transactions on Multimedia 23 (2020): 1895-1907.
[3] https://theartcareerproject.com/careers/foley-art/
[4] https://beatproduction.net/foley-sfx-pack/