FA-GAN: Few-artifacts High-fidelity GAN-based Vocoder
0. Contents
1. Abstract
Generative adversarial network (GAN) based vocoders have achieved significant attention in speech synthesis with high quality and fast inference speed. However, there still exist many noticeable artifacts between the ground truth and generated samples, resulting in the quality decline of synthesized speeches. In this work, we propose a novel GAN-based vocoder designed for the purpose of few-artifacts and high-fidelity, called FA-GAN. 1) To suppress the aliasing artifacts caused by non-ideal upsampling layers in high-frequency areas, we introduce the twin deconvolution module in the generator. 2) To alleviate the blurring artifacts and enrich the reconstruction of spectral details, we propose a novel fine-grained multi-resolution real and imaginary loss by inheriting the real and imaginary components of complex spectrograms in the discriminators. The experimental results reveal that FA-GAN outperforms the state-of-the-art approaches in promoting audio quality and alleviating spectral artifacts, and exhibits superior performance when applied to unseen speaker scenarios.2. Seen speaker (LJSpeech)
We train FA-GAN with the dataset of LJSpeech, and randomly devide the datatset into training set, validation set and test set, 80%, 10%, 10% respectively. Here are demos of baselines and our proposed FA-GAN in the scenarios of seen speaker.
Demos:
speaker | Ground Truth | HiFi-GAN | Univnet-c32 | Avocodo | FA-GAN |
---|---|---|---|---|---|
LJ001-0028 |
|||||
LJ005-0156 |
|||||
LJ008-0172 |
|||||
LJ016-0156 |
|||||
LJ035-0086 |
3. Unseen speakers (VCTK)
We test the unseen speakers scenarios on the VCTK Corpus and all audio samples are downsampled to 22050 Hz, the audio demos are as follows.
Demos:
speaker | Ground Truth | HiFi-GAN | Univnet-c32 | Avocodo | FA-GAN |
---|---|---|---|---|---|
p258 |
|||||
p264 |
|||||
p265 |
|||||
p284 |
|||||
p340 |