Jun-Hak Yun, Seung-Bin Kim, and Seong-Whan Lee
Audio super-resolution is challenging owing to its ill-posed nature. Recently, the application of diffusion models in audio super-resolution has shown promising results in alleviating this challenge. However, diffusion-based models have limitations, primarily the necessity for numerous sampling steps, which causes significantly increased latency when synthesizing high-quality audio samples. In this paper, we propose FLowHigh, a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution. We also explore probability paths specially tailored for audio super-resolution, which effectively capture high-resolution audio distributions, thereby enhancing reconstruction quality. The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates. The experimental results on the VCTK benchmark dataset demonstrate that FLowHigh achieves state-of-the-art performance in audio super-resolution, as evaluated by log-spectral distance and ViSQOL while maintaining computationally efficiency with only a single-step sampling process.
Official implementation of FLowHigh: https://github.com/jjunak-yun/FLowHigh_code
FLowHigh was trained and evaluated on the VCTK dataset.
We compared FLowHigh with several audio super-resolution models as:
3. mdctGAN [Code]
We recommend conducting a brief frequency test to ensure precise listening.
|
WARNING: LOUD HIGH-FREQUENCY SOUNDS MAY RESULT IN UNCOMFORTABLE OR PAINFUL SENSATIONS. (For a smoother testing experience, please select the frequency by sliding the slider bar rather than clicking.) 2000 Hz 22000 Hz
|
|---|
For a fair comparison, we used the official implementations and provided checkpoints of the comparison models. Additionally, the demo samples were randomly extracted from the test set. Diffusion-based models, Nu-Wave2 and UDM+, generated audio samples using 50 sampling steps to achieve adequate sound quality. You can listen audio samples with a sampling rate of 48 kHz, generated from input sampling rates ranging from 8 kHz to 24 kHz. Additionally, we enhanced the ease of efficiency verification by indicating the number of function evaluations (NFE) of our model.
Audio: p364_055 | ||
|---|---|---|
GT
|
Input
|
GT Reconstruction (w/ Post-processing)
|
NU-Wave2
|
UDM+
|
mdctGAN
|
Fre-Painter
|
FLowHigh (Euler, NFE=1)
|
FLowHigh (Midpoint, NFE=2)
|
Audio: p361_281 | ||
|---|---|---|
GT
|
Input
|
GT Reconstruction (w/ Post-processing)
|
NU-Wave2
|
UDM+
|
mdctGAN
|
Fre-Painter
|
FLowHigh (Euler, NFE=1)
|
FLowHigh (Midpoint, NFE=2)
|
Audio: p360_253 | ||
|---|---|---|
GT
|
Input
|
GT Reconstruction (w/ Post-processing)
|
NU-Wave2
|
UDM+
|
mdctGAN
|
Fre-Painter
|
FLowHigh (Euler, NFE=1)
|
FLowHigh (Midpoint, NFE=2)
|
Audio: p376_159 | ||||
|---|---|---|---|---|
GT
|
Input
|
GT Reconstruction (w/ Post-processing)
|
||
NU-Wave2
|
UDM+
|
mdctGAN
|
||
Fre-Painter
|
FLowHigh (Euler, NFE=1)
|
FLowHigh (Midpoint, NFE=2)
|
||
Audio: p362_115 (8 kHz to 48 kHz) |
||||
|---|---|---|---|---|
GT
|
Input
|
GT Reconstruction (w/ Post-processing)
|
||
Nu-Wave2 (NFE=1)
|
Nu-Wave2 (NFE=10)
|
Nu-Wave2 (NFE=25)
|
Nu-Wave2 (NFE=50)
|
Nu-Wave2 (NFE=100)
|
UDM+ (NFE=1)
|
UDM+ (NFE=10)
|
UDM+ (NFE=25)
|
UDM+ (NFE=50)
|
UDM+ (NFE=100)
|
FLowHigh (Euler, NFE=1)
|
FLowHigh (Midpoint, NFE=2)
|
|||
Audio: p363_188 (12 kHz to 48 kHz) |
||||
|---|---|---|---|---|
GT
|
Input
|
GT Reconstruction (w/ Post-processing)
|
||
Nu-Wave2 (NFE=1)
|
Nu-Wave2 (NFE=10)
|
Nu-Wave2 (NFE=25)
|
Nu-Wave2 (NFE=50)
|
Nu-Wave2 (NFE=100)
|
UDM+ (NFE=1)
|
UDM+ (NFE=10)
|
UDM+ (NFE=25)
|
UDM+ (NFE=50)
|
UDM+ (NFE=100)
|
FLowHigh (Euler, NFE=1)
|
FLowHigh (Midpoint, NFE=2)
|
|||
Audio: p374_336 (16 kHz to 48 kHz) |
||||
|---|---|---|---|---|
GT
|
Input
|
GT Reconstruction (w/ Post-processing)
|
||
Nu-Wave2 (NFE=1)
|
Nu-Wave2 (NFE=10)
|
Nu-Wave2 (NFE=25)
|
Nu-Wave2 (NFE=50)
|
Nu-Wave2 (NFE=100)
|
UDM+ (NFE=1)
|
UDM+ (NFE=10)
|
UDM+ (NFE=25)
|
UDM+ (NFE=50)
|
UDM+ (NFE=100)
|
FLowHigh (Euler, NFE=1)
|
FLowHigh (Midpoint, NFE=2)
|
|||
Audio: p360_072 (24 kHz to 48 kHz) |
||||
|---|---|---|---|---|
GT
|
Input
|
GT Reconstruction (w/ Post-processing)
|
||
Nu-Wave2 (NFE=1)
|
Nu-Wave2 (NFE=10)
|
Nu-Wave2 (NFE=25)
|
Nu-Wave2 (NFE=50)
|
Nu-Wave2 (NFE=100)
|
UDM+ (NFE=1)
|
UDM+ (NFE=10)
|
UDM+ (NFE=25)
|
UDM+ (NFE=50)
|
UDM+ (NFE=100)
|
FLowHigh (Euler, NFE=1)
|
FLowHigh (Midpoint, NFE=2)
|
|||
Audio: p360_003 (8 kHz to 48 kHz) |
||
|---|---|---|
GT
|
Input
|
GT Reconstruction (w/ Post-processing)
|
FLowHigh
|
FLowHigh
|
FLowHigh
|
Audio: p364_165 (12 kHz to 48 kHz) |
||
|---|---|---|
GT
|
Input
|
GT Reconstruction (w/ Post-processing)
|
FLowHigh
|
FLowHigh
|
FLowHigh
|
Audio: p362_207 (16 kHz to 48 kHz) |
||
|---|---|---|
GT
|
Input
|
GT Reconstruction (w/ Post-processing)
|
FLowHigh
|
FLowHigh
|
FLowHigh
|