Jun-Hak Yun, Seung-Bin Kim, and Seong-Whan Lee
Audio super-resolution is challenging owing to its ill-posed nature. Recently, the application of diffusion models in audio super-resolution has shown promising results in alleviating this challenge. However, diffusion-based models have limitations, primarily the necessity for numerous sampling steps, which causes significantly increased latency when synthesizing high-quality audio samples. In this paper, we propose FLowHigh, a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution. We also explore probability paths specially tailored for audio super-resolution, which effectively capture high-resolution audio distributions, thereby enhancing reconstruction quality. The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates. The experimental results on the VCTK benchmark dataset demonstrate that FLowHigh achieves state-of-the-art performance in audio super-resolution, as evaluated by log-spectral distance and ViSQOL while maintaining computationally efficiency with only a single-step sampling process.
Official implementation of FLowHigh: https://github.com/jjunak-yun/FLowHigh_code
FLowHigh was trained and evaluated on the VCTK dataset.
We compared FLowHigh with several audio super-resolution models as:
3. mdctGAN [Code]
We recommend conducting a brief frequency test to ensure precise listening.
WARNING: LOUD HIGH-FREQUENCY SOUNDS MAY RESULT IN UNCOMFORTABLE OR PAINFUL SENSATIONS. (For a smoother testing experience, please select the frequency by sliding the slider bar rather than clicking.) 2000 Hz 22000 Hz![]() |
---|
For a fair comparison, we used the official implementations and provided checkpoints of the comparison models. Additionally, the demo samples were randomly extracted from the test set. Diffusion-based models, Nu-Wave2 and UDM+, generated audio samples using 50 sampling steps to achieve adequate sound quality. You can listen audio samples with a sampling rate of 48 kHz, generated from input sampling rates ranging from 8 kHz to 24 kHz. Additionally, we enhanced the ease of efficiency verification by indicating the number of function evaluations (NFE) of our model.
Audio: p364_055 | ||
---|---|---|
GT ![]() |
Input ![]() |
GT Reconstruction (w/ Post-processing) ![]() |
NU-Wave2 ![]() |
UDM+ ![]() |
mdctGAN ![]() |
Fre-Painter ![]() |
FLowHigh (Euler, NFE=1) ![]() |
FLowHigh (Midpoint, NFE=2) ![]() |
Audio: p361_281 | ||
---|---|---|
GT ![]() |
Input ![]() |
GT Reconstruction (w/ Post-processing) ![]() |
NU-Wave2 ![]() |
UDM+ ![]() |
mdctGAN ![]() |
Fre-Painter ![]() |
FLowHigh (Euler, NFE=1) ![]() |
FLowHigh (Midpoint, NFE=2) ![]() |
Audio: p360_253 | ||
---|---|---|
GT ![]() |
Input ![]() |
GT Reconstruction (w/ Post-processing) ![]() |
NU-Wave2 ![]() |
UDM+ ![]() |
mdctGAN ![]() |
Fre-Painter ![]() |
FLowHigh (Euler, NFE=1) ![]() |
FLowHigh (Midpoint, NFE=2) ![]() |
Audio: p376_159 | ||||
---|---|---|---|---|
GT ![]() |
Input ![]() |
GT Reconstruction (w/ Post-processing) ![]() |
||
NU-Wave2 ![]() |
UDM+ ![]() |
mdctGAN ![]() |
||
Fre-Painter ![]() |
FLowHigh (Euler, NFE=1) ![]() |
FLowHigh (Midpoint, NFE=2) ![]() |
Audio: p362_115 (8 kHz to 48 kHz) |
||||
---|---|---|---|---|
GT
![]() |
Input
![]() |
GT Reconstruction (w/ Post-processing) ![]() |
||
Nu-Wave2 (NFE=1) ![]() |
Nu-Wave2 (NFE=10) ![]() |
Nu-Wave2 (NFE=25) ![]() |
Nu-Wave2 (NFE=50) ![]() |
Nu-Wave2 (NFE=100) ![]() |
UDM+ (NFE=1) ![]() |
UDM+ (NFE=10) ![]() |
UDM+ (NFE=25) ![]() |
UDM+ (NFE=50) ![]() |
UDM+ (NFE=100) ![]() |
FLowHigh (Euler, NFE=1) ![]() |
FLowHigh (Midpoint, NFE=2) ![]() |
Audio: p363_188 (12 kHz to 48 kHz) |
||||
---|---|---|---|---|
GT
![]() |
Input
![]() |
GT Reconstruction (w/ Post-processing) ![]() |
||
Nu-Wave2 (NFE=1) ![]() |
Nu-Wave2 (NFE=10) ![]() |
Nu-Wave2 (NFE=25) ![]() |
Nu-Wave2 (NFE=50) ![]() |
Nu-Wave2 (NFE=100) ![]() |
UDM+ (NFE=1) ![]() |
UDM+ (NFE=10) ![]() |
UDM+ (NFE=25) ![]() |
UDM+ (NFE=50) ![]() |
UDM+ (NFE=100) ![]() |
FLowHigh (Euler, NFE=1) ![]() |
FLowHigh (Midpoint, NFE=2) ![]() |
Audio: p374_336 (16 kHz to 48 kHz) |
||||
---|---|---|---|---|
GT
![]() |
Input
![]() |
GT Reconstruction (w/ Post-processing) ![]() |
||
Nu-Wave2 (NFE=1) ![]() |
Nu-Wave2 (NFE=10) ![]() |
Nu-Wave2 (NFE=25) ![]() |
Nu-Wave2 (NFE=50) ![]() |
Nu-Wave2 (NFE=100) ![]() |
UDM+ (NFE=1) ![]() |
UDM+ (NFE=10) ![]() |
UDM+ (NFE=25) ![]() |
UDM+ (NFE=50) ![]() |
UDM+ (NFE=100) ![]() |
FLowHigh (Euler, NFE=1) ![]() |
FLowHigh (Midpoint, NFE=2) ![]() |
Audio: p360_072 (24 kHz to 48 kHz) |
||||
---|---|---|---|---|
GT
![]() |
Input
![]() |
GT Reconstruction (w/ Post-processing) ![]() |
||
Nu-Wave2 (NFE=1) ![]() |
Nu-Wave2 (NFE=10) ![]() |
Nu-Wave2 (NFE=25) ![]() |
Nu-Wave2 (NFE=50) ![]() |
Nu-Wave2 (NFE=100) ![]() |
UDM+ (NFE=1) ![]() |
UDM+ (NFE=10) ![]() |
UDM+ (NFE=25) ![]() |
UDM+ (NFE=50) ![]() |
UDM+ (NFE=100) ![]() |
FLowHigh (Euler, NFE=1) ![]() |
FLowHigh (Midpoint, NFE=2) ![]() |
Audio: p360_003 (8 kHz to 48 kHz) |
||
---|---|---|
GT
![]() |
Input
![]() |
GT Reconstruction (w/ Post-processing) ![]() |
FLowHigh ![]() |
FLowHigh ![]() |
FLowHigh ![]() |
Audio: p364_165 (12 kHz to 48 kHz) |
||
---|---|---|
GT
![]() |
Input
![]() |
GT Reconstruction (w/ Post-processing) ![]() |
FLowHigh ![]() |
FLowHigh ![]() |
FLowHigh ![]() |
Audio: p362_207 (16 kHz to 48 kHz) |
||
---|---|---|
GT
![]() |
Input
![]() |
GT Reconstruction (w/ Post-processing) ![]() |
FLowHigh ![]() |
FLowHigh ![]() |
FLowHigh ![]() |