FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single Step Flow Matching

 

Jun-Hak Yun, Seung-Bin Kim, and Seong-Whan Lee

Abstract

Audio super-resolution is challenging owing to its ill-posed nature. Recently, the application of diffusion models in audio super-resolution has shown promising results in alleviating this challenge. However, diffusion-based models have limitations, primarily the necessity for numerous sampling steps, which causes significantly increased latency when synthesizing high-quality audio samples. In this paper, we propose FLowHigh, a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution. We also explore probability paths specially tailored for audio super-resolution, which effectively capture high-resolution audio distributions, thereby enhancing reconstruction quality. The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates. The experimental results on the VCTK benchmark dataset demonstrate that FLowHigh achieves state-of-the-art performance in audio super-resolution, as evaluated by log-spectral distance and ViSQOL while maintaining computationally efficiency with only a single-step sampling process.

 

Official implementation of FLowHigh: https://github.com/jjunak-yun/FLowHigh_code

FLowHigh was trained and evaluated on the VCTK dataset.

 

We compared FLowHigh with several audio super-resolution models as:

1. NU-Wave2 [Code][Demo page]

2. UDM+ [Code][Demo page]

3. mdctGAN [Code]

4. Fre-Painter [Code][Demo page]

Audible Frequency

We recommend conducting a brief frequency test to ensure precise listening.

                                                                                                                                                                                                                                                                                                                              

WARNING: LOUD HIGH-FREQUENCY SOUNDS MAY RESULT IN UNCOMFORTABLE OR PAINFUL SENSATIONS.

(For a smoother testing experience, please select the frequency by sliding the slider bar rather than clicking.)

2000 Hz       22000 Hz


                                                                                                                                                                                                                                                                                                                                          

FLowHigh

For a fair comparison, we used the official implementations and provided checkpoints of the comparison models. Additionally, the demo samples were randomly extracted from the test set. Diffusion-based models, Nu-Wave2 and UDM+, generated audio samples using 50 sampling steps to achieve adequate sound quality. You can listen audio samples with a sampling rate of 48 kHz, generated from input sampling rates ranging from 8 kHz to 24 kHz. Additionally, we enhanced the ease of efficiency verification by indicating the number of function evaluations (NFE) of our model.

Audio: p364_055

Sampling rate

GT

Input

GT Reconstruction (w/ Post-processing)

NU-Wave2

UDM+

mdctGAN

Fre-Painter

FLowHigh (Euler, NFE=1)

FLowHigh (Midpoint, NFE=2)

Audio: p361_281

Sampling rate

GT

Input

GT Reconstruction (w/ Post-processing)

NU-Wave2

UDM+

mdctGAN

Fre-Painter

FLowHigh (Euler, NFE=1)

FLowHigh (Midpoint, NFE=2)

Audio: p360_253

Sampling rate

GT

Input

GT Reconstruction (w/ Post-processing)

NU-Wave2

UDM+

mdctGAN

Fre-Painter

FLowHigh (Euler, NFE=1)

FLowHigh (Midpoint, NFE=2)

Audio: p376_159

Sampling rate

GT

Input

GT Reconstruction (w/ Post-processing)

NU-Wave2

UDM+

mdctGAN

Fre-Painter

FLowHigh (Euler, NFE=1)

FLowHigh (Midpoint, NFE=2)

Sampling Steps

Audio: p362_115 (8 kHz to 48 kHz)

GT

Input

GT Reconstruction (w/ Post-processing)

Nu-Wave2 (NFE=1)

Nu-Wave2 (NFE=10)

Nu-Wave2 (NFE=25)

Nu-Wave2 (NFE=50)

Nu-Wave2 (NFE=100)

UDM+ (NFE=1)

UDM+ (NFE=10)

UDM+ (NFE=25)

UDM+ (NFE=50)

UDM+ (NFE=100)

FLowHigh (Euler, NFE=1)

FLowHigh (Midpoint, NFE=2)

Audio: p363_188 (12 kHz to 48 kHz)

GT

Input

GT Reconstruction (w/ Post-processing)

Nu-Wave2 (NFE=1)

Nu-Wave2 (NFE=10)

Nu-Wave2 (NFE=25)

Nu-Wave2 (NFE=50)

Nu-Wave2 (NFE=100)

UDM+ (NFE=1)

UDM+ (NFE=10)

UDM+ (NFE=25)

UDM+ (NFE=50)

UDM+ (NFE=100)

FLowHigh (Euler, NFE=1)

FLowHigh (Midpoint, NFE=2)

Audio: p374_336 (16 kHz to 48 kHz)

GT

Input

GT Reconstruction (w/ Post-processing)

Nu-Wave2 (NFE=1)

Nu-Wave2 (NFE=10)

Nu-Wave2 (NFE=25)

Nu-Wave2 (NFE=50)

Nu-Wave2 (NFE=100)

UDM+ (NFE=1)

UDM+ (NFE=10)

UDM+ (NFE=25)

UDM+ (NFE=50)

UDM+ (NFE=100)

FLowHigh (Euler, NFE=1)

FLowHigh (Midpoint, NFE=2)

Audio: p360_072 (24 kHz to 48 kHz)

GT

Input

GT Reconstruction (w/ Post-processing)

Nu-Wave2 (NFE=1)

Nu-Wave2 (NFE=10)

Nu-Wave2 (NFE=25)

Nu-Wave2 (NFE=50)

Nu-Wave2 (NFE=100)

UDM+ (NFE=1)

UDM+ (NFE=10)

UDM+ (NFE=25)

UDM+ (NFE=50)

UDM+ (NFE=100)

FLowHigh (Euler, NFE=1)

FLowHigh (Midpoint, NFE=2)

Path Analysis

Audio: p360_003 (8 kHz to 48 kHz)

GT

Input

GT Reconstruction (w/ Post-processing)

FLowHigh
(\(\mu_t(z)=tx_1, \quad \sigma_t=1-(1-\sigma)t\))

FLowHigh
(\(\mu_t(z)=tx_1+(1-t)x_0, \quad \sigma_t=\sigma\))

FLowHigh
(\(\mu_t(z)=tx_1+(1-t)x_0, \quad \sigma_t=1-(1-\sigma)t\))

Audio: p364_165 (12 kHz to 48 kHz)

GT

Input

GT Reconstruction (w/ Post-processing)

FLowHigh
(\(\mu_t(z)=tx_1, \quad \sigma_t=1-(1-\sigma)t\))

FLowHigh
(\(\mu_t(z)=tx_1+(1-t)x_0, \quad \sigma_t=\sigma\))

FLowHigh
(\(\mu_t(z)=tx_1+(1-t)x_0, \quad \sigma_t=1-(1-\sigma)t\))

Audio: p362_207 (16 kHz to 48 kHz)

GT

Input

GT Reconstruction (w/ Post-processing)

FLowHigh
(\(\mu_t(z)=tx_1, \quad \sigma_t=1-(1-\sigma)t\))

FLowHigh
(\(\mu_t(z)=tx_1+(1-t)x_0, \quad \sigma_t=\sigma\))

FLowHigh
(\(\mu_t(z)=tx_1+(1-t)x_0, \quad \sigma_t=1-(1-\sigma)t\))