FLowHigh Demo

Abstract

Audio super-resolution is challenging owing to its ill-posed nature. Recently, the application of diffusion models in audio super-resolution has shown promising results in alleviating this challenge. However, diffusion-based models have limitations, primarily the necessity for numerous sampling steps, which causes significantly increased latency when synthesizing high-quality audio samples. In this paper, we propose FLowHigh, a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution. We also explore probability paths specially tailored for audio super-resolution, which effectively capture high-resolution audio distributions, thereby enhancing reconstruction quality. The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates. The experimental results on the VCTK benchmark dataset demonstrate that FLowHigh achieves state-of-the-art performance in audio super-resolution, as evaluated by log-spectral distance and ViSQOL while maintaining computationally efficiency with only a single-step sampling process.

Official implementation of FLowHigh: https://github.com/jjunak-yun/FLowHigh_code

FLowHigh was trained and evaluated on the VCTK dataset.

We compared FLowHigh with several audio super-resolution models as:

1. NU-Wave2 [Code][Demo page]

2. UDM+ [Code][Demo page]

3. mdctGAN [Code]

4. Fre-Painter [Code][Demo page]

Audible Frequency

We recommend conducting a brief frequency test to ensure precise listening.

WARNING: LOUD HIGH-FREQUENCY SOUNDS MAY RESULT IN UNCOMFORTABLE OR PAINFUL SENSATIONS. (For a smoother testing experience, please select the frequency by sliding the slider bar rather than clicking.) 2000 Hz 22000 Hz

FLowHigh

For a fair comparison, we used the official implementations and provided checkpoints of the comparison models. Additionally, the demo samples were randomly extracted from the test set. Diffusion-based models, Nu-Wave2 and UDM+, generated audio samples using 50 sampling steps to achieve adequate sound quality. You can listen audio samples with a sampling rate of 48 kHz, generated from input sampling rates ranging from 8 kHz to 24 kHz. Additionally, we enhanced the ease of efficiency verification by indicating the number of function evaluations (NFE) of our model.

Audio: p364_055 Sampling rate
GT	Input	GT Reconstruction (w/ Post-processing)
NU-Wave2	UDM+	mdctGAN
Fre-Painter	FLowHigh (Euler, NFE=1)	FLowHigh (Midpoint, NFE=2)

Audio: p361_281 Sampling rate
GT	Input	GT Reconstruction (w/ Post-processing)
NU-Wave2	UDM+	mdctGAN
Fre-Painter	FLowHigh (Euler, NFE=1)	FLowHigh (Midpoint, NFE=2)

Audio: p360_253 Sampling rate
GT	Input	GT Reconstruction (w/ Post-processing)
NU-Wave2	UDM+	mdctGAN
Fre-Painter	FLowHigh (Euler, NFE=1)	FLowHigh (Midpoint, NFE=2)

Audio: p376_159 Sampling rate
GT	Input	GT Reconstruction (w/ Post-processing)
NU-Wave2	UDM+	mdctGAN
Fre-Painter	FLowHigh (Euler, NFE=1)	FLowHigh (Midpoint, NFE=2)

Sampling Steps

Audio: p362_115 (8 kHz to 48 kHz)
GT	Input	GT Reconstruction (w/ Post-processing)
Nu-Wave2 (NFE=1)	Nu-Wave2 (NFE=10)	Nu-Wave2 (NFE=25)	Nu-Wave2 (NFE=50)	Nu-Wave2 (NFE=100)
UDM+ (NFE=1)	UDM+ (NFE=10)	UDM+ (NFE=25)	UDM+ (NFE=50)	UDM+ (NFE=100)
FLowHigh (Euler, NFE=1)	FLowHigh (Midpoint, NFE=2)

Audio: p363_188 (12 kHz to 48 kHz)
GT	Input	GT Reconstruction (w/ Post-processing)
Nu-Wave2 (NFE=1)	Nu-Wave2 (NFE=10)	Nu-Wave2 (NFE=25)	Nu-Wave2 (NFE=50)	Nu-Wave2 (NFE=100)
UDM+ (NFE=1)	UDM+ (NFE=10)	UDM+ (NFE=25)	UDM+ (NFE=50)	UDM+ (NFE=100)
FLowHigh (Euler, NFE=1)	FLowHigh (Midpoint, NFE=2)

Audio: p374_336 (16 kHz to 48 kHz)
GT	Input	GT Reconstruction (w/ Post-processing)
Nu-Wave2 (NFE=1)	Nu-Wave2 (NFE=10)	Nu-Wave2 (NFE=25)	Nu-Wave2 (NFE=50)	Nu-Wave2 (NFE=100)
UDM+ (NFE=1)	UDM+ (NFE=10)	UDM+ (NFE=25)	UDM+ (NFE=50)	UDM+ (NFE=100)
FLowHigh (Euler, NFE=1)	FLowHigh (Midpoint, NFE=2)

Audio: p360_072 (24 kHz to 48 kHz)
GT	Input	GT Reconstruction (w/ Post-processing)
Nu-Wave2 (NFE=1)	Nu-Wave2 (NFE=10)	Nu-Wave2 (NFE=25)	Nu-Wave2 (NFE=50)	Nu-Wave2 (NFE=100)
UDM+ (NFE=1)	UDM+ (NFE=10)	UDM+ (NFE=25)	UDM+ (NFE=50)	UDM+ (NFE=100)
FLowHigh (Euler, NFE=1)	FLowHigh (Midpoint, NFE=2)