ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

Anonymous Authors

Abstract

Recent advancements in text-guided audio generation have yielded promising results in diverse domains, including sound effects, environmental audio, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an Environment-Aware text-to-speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript-aligned speech latent with text-conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain-specific representation alignment objective tailored to Environment-Aware TTS, leveraging complementary self-supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.

Model Framework Overview

Model framework overview — Figure 1: Overview of ImmersiveTTS. A dual-stream MM-DiT backbone conditions the speech stream on content prompt-aligned linguistic features. At the same time, Flan-T5 token embeddings drive the environmental context stream, and CLAP embeddings modulate AdaLN for global conditioning. The model is trained with flow matching and domain-specific REPA objectives.

Tasks

Examples

Environment-Aware Text-to-Speech

Sample 1

Transcription: "Pretty smart she probably she snaps at them."

Environment Description: "Birds chirping with leaves rustling and wind"

Ground Truth

Reconstructed

VoiceLDM

VoiceDiT

ImmersiveTTS (Ours)

Sample 2

Transcription: "Okay, let's go look at the paper."

Environment Description: "Rapid and repeated gunfire"

Ground Truth

Reconstructed

VoiceLDM

VoiceDiT

ImmersiveTTS (Ours)

Sample 3

Transcription: "Second gear the car doesn't spool up too quick anymore because of the turbo and"

Environment Description: "A vehicle accelerating"

Ground Truth

Reconstructed

VoiceLDM

VoiceDiT

ImmersiveTTS (Ours)

Sample 4

Transcription: "Just like that and get a glass out"

Environment Description: "Dishes clanking and doors banging"

Ground Truth

Reconstructed

VoiceLDM

VoiceDiT

ImmersiveTTS (Ours)

Text-to-Speech

Sample 1

Transcription: "The boy from the house is coming up for the rector."

Ground Truth

Reconstructed

VoiceLDM

VoiceDiT

ImmersiveTTS (Ours)

Sample 2

Transcription: "You represented a divinity, beautiful, disdainful, inconstant."

Ground Truth

Reconstructed

VoiceLDM

VoiceDiT

ImmersiveTTS (Ours)

Sample 3

Transcription: ""Half past ten," said the captain, looking at his watch."

Ground Truth

Reconstructed

VoiceLDM

VoiceDiT

ImmersiveTTS (Ours)

REPA Analysis

Sample 1

Transcription: "Your competition chatter is more of a rolling type of chatter."

Environment Description: "A duck quacks while birds chirp in the distance."

Ground Truth

Base (w/o REPA)

w/ WavLM

w/ ATST-Frame

w/ USAD

w/ WavLM+USAD

w/ USAD+ATST-Frame

w/ WavLM+ATST-Frame

Sample 2

Transcription: "Little bit of bad weather."

Environment Description: "Rain falling"

Ground Truth

Base (w/o REPA)

w/ WavLM

w/ ATST-Frame

w/ USAD

w/ WavLM+USAD

w/ USAD+ATST-Frame

w/ WavLM+ATST-Frame

Sample 3

Transcription: "Nice boat."

Environment Description: "A motorboat engine running followed by a plastic clank while wind blows into a microphone."

Ground Truth

Base (w/o REPA)

w/ WavLM

w/ ATST-Frame

w/ USAD

w/ WavLM+USAD

w/ USAD+ATST-Frame

w/ WavLM+ATST-Frame