Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration towards High-Quality Speech Generation from SSL features

Authors: Hien Ohnaka, Yuma Shirahata, Masaya Kawamura

[~~arxiv~~][~~code (will be available)~~]

Background and Motivation
Proposed method
Overall results
Speech samples
References

Background and Motivation

With the development of recent self-supervised learning (SSL) models, generation tasks from SSL features have also achieved success [1, 2]. Neural vocoders from SSL features are crucial components that determine the topline in these tasks. WaveFit [3] is a vocoder that has already achieved success in speech generation tasks from SSL features [1, 2]. This is a fixed-point iteration vocoder that combines GANs with diffusion model-like iterative inference.

However, compared to the waveform generation from Mel-spectrogram, WaveFit from SSL features has two limitations:

Initial noise sampling
- WaveFit from mel-spectrograms: Well-designed noise sampling [4] is available. This approach is expected to provide the model with an reasonable prior for waveform generation, but this is required spectral envelope information.
- WaveFit from SSL features: Because the spectral envelope cannot be accessed from SSL features, Sampling from a standard normal distribution $\mathcal{N}(0,I)$ was used. Compared to the approach mentioned above, this may compromise performance.
Gain adjustment
- WaveFit from mel-spectrograms: Following gain adjustment for predicted waveform $z$ is performed using the power $P_z$ of the output and the ground-truth power $P_c$ from the mel-spectrogram: $\mathcal{G}({z}_t,{c})=\sqrt{(P_c/(P_z+s)}{z}_t.$ As a result, the vocoder is freed from the implicit energy estimation task and can focus on essential waveform modeling.
- WaveFit from SSL features: Because ground-truth power cannot be accessed from SSL features, the following reference-free gain adjustment was applied: $\hat{\mathcal{G}}({z}_t)=0.9 \cdot {z}_t/\max(\mathrm{abs}({z}_t)).$ This adjustment compromises the advantages mentioned above.

Our goal is to improve the performance of WaveFit from SSL features by bridging these gaps when compared to WaveFit from mel-spectrogram. To achieve this, we introduce trainable priors inspired by RestoreGrad [5].

Proposed method

Figure 1: Overview of the proposed model. During training, Posterior VAE derived from the target waveform and the SSL feature is used for noise sampling and gain adjustment. During inference, Prior VAE derived from the SSL feature is used for same process. Solid arrows are enabled during both training and inference.

We propose a neural vocoder with trainable prior and fixed-point iteration (WaveTrainerFit) for improved waveform generation from SSL features. First, by introducing variational autoencoder (VAE)-based trainable priors, we achieve sampling of noise $\mathcal{S}(\Sigma)$ close to target waveform. Since inference can start from a point close to speech, high-quality waveform generation with fewer iterations and robustly maintaining speaker characteristics is expected. Furthermore, by imposing constraints on the priors to match the energy of speech, we realize reference-aware gain adjustment $\mathcal{G}_\mathrm{ssl}(z_t, \Sigma)$, which frees the vocoder from the implicit energy inference task. As a result, the model can focus on more important aspects of waveform modeling, and is thought to mitigate the difficulty of training.

Overall results

We used the LibriTTS-R corpus [6]. For evaluation, we used the speech included in “test-clean”. We used three feature extractors for conditioning features: WavLM [7], XLS-R [8], and Whisper-medium-encoder [9].

Table 1: Evaluation results when using LibriTTS-R test-clean, 8-th layer SSL features, and $T=5$. Bold indicates the best method under the same conditions, and underlines indicate significant differences between WaveFit and WaveTrainerFit.

Speech samples

Compared to baselines

260_123288_000023_000005
(middle-pitch, male)

Transcription.
"My eyes fail under the dazzling light, my ears are stunned with the incessant crash of thunder."

clean

HiFi-GAN (WavLM)

WaveFit (WavLM)

Proposed WaveTrainerFit (WavLM)

HiFi-GAN (XLS-R)

WaveFit (XLS-R)

Proposed WaveTrainerFit (XLS-R)

HiFi-GAN (Whisper)

WaveFit (Whisper)

Proposed WaveTrainerFit (Whisper)

8555_284449_000041_000001
(high-pitch, female)

Transcription.
"I'll have 'Sizzle make a fine yard for the goat, where he'll have plenty of blue grass to eat."

clean

HiFi-GAN (WavLM)

WaveFit (WavLM)

Proposed WaveTrainerFit (WavLM)

HiFi-GAN (XLS-R)

WaveFit (XLS-R)

Proposed WaveTrainerFit (XLS-R)

HiFi-GAN (Whisper)

WaveFit (Whisper)

Proposed WaveTrainerFit (Whisper)

1188_133604_000024_000000
(low-pitch, male)

Transcription.
"But in this vignette, copied from Turner, you have the two principles brought out perfectly."

clean

HiFi-GAN (WavLM)

WaveFit (WavLM)

Proposed WaveTrainerFit (WavLM)

HiFi-GAN (XLS-R)

WaveFit (XLS-R)

Proposed WaveTrainerFit (XLS-R)

HiFi-GAN (Whisper)

WaveFit (Whisper)

Proposed WaveTrainerFit (Whisper)

1995_1836_000030_000000
(middle-pitch, female)

Transcription.
"But you mean to say you can't even advise her?"

clean

HiFi-GAN (WavLM)

WaveFit (WavLM)

Proposed WaveTrainerFit (WavLM)

HiFi-GAN (XLS-R)

WaveFit (XLS-R)

Proposed WaveTrainerFit (XLS-R)

HiFi-GAN (Whisper)

WaveFit (Whisper)

Proposed WaveTrainerFit (Whisper)

Impact of intermediate outputs

Objective evaluation results

Although the introduction of VAE slightly increases the RTF, the proposed method shows superior scores in all metrics except for the RTF, at all iteration counts.

Speech samples

4507_16021_000032_000001
"Is it really the French tongue, the great human tongue?"

clean

WaveFit (Whisper, l8) Iteration 0 (Normalized initial noise)

Proposed WaveTrainerFit (Whisper, l8) Iteration 0

WaveFit Iteration 1

Proposed WaveTrainerFit Iteration 1

WaveFit Iteration 2

Proposed WaveTrainerFit Iteration 2

WaveFit Iteration 3

Proposed WaveTrainerFit Iteration 3

WaveFit Iteration 4

Proposed WaveTrainerFit Iteration 4

WaveFit Iteration 5

Proposed WaveTrainerFit Iteration 5

5105_28241_000027_000002: "Nothing was to be done but to put about, and return in disappointment towards the north."

clean

WaveFit (Whisper, l8) Iteration 0 (Normalized initial noise)

Proposed WaveTrainerFit (Whisper, l8) Iteration 0

WaveFit Iteration 1

Proposed WaveTrainerFit Iteration 1

WaveFit Iteration 2

Proposed WaveTrainerFit Iteration 2

WaveFit Iteration 3

Proposed WaveTrainerFit Iteration 3

WaveFit Iteration 4

Proposed WaveTrainerFit Iteration 4

WaveFit Iteration 5

Proposed WaveTrainerFit Iteration 5

SSL layer-wise analysis

Generally, features from shallow layers are known to contain many acoustic features from the input samples, while features from deeper layers contain many semantic features from targets such as pseudo-labels. To verify whether the proposed method works robustly for features of various properties, we evaluated it on WavLM layers 2 and 24.

Objective evaluation results

Speech samples

121_127105_000008_000000: "I quite agree--in regard to Griffin's ghost, or whatever it was--that its appearing first to the little boy, at so tender an age, adds a particular touch."

clean

WaveFit (WavLM, l2)

WaveFit (WavLM, l8)

WaveFit (WavLM, l24)

Proposed WaveTrainerFit (WavLM, l2)

Proposed WaveTrainerFit (WavLM, l8)

Proposed WaveTrainerFit (WavLM, l24)

260_123288_000004_000001: "The atmosphere is charged with vapours, pervaded with the electricity generated by the evaporation of saline waters."

clean

WaveFit (WavLM, l2)

WaveFit (WavLM, l8)

WaveFit (WavLM, l24)

Proposed WaveTrainerFit (WavLM, l2)

Proposed WaveTrainerFit (WavLM, l8)

Proposed WaveTrainerFit (WavLM, l24)

It can be heard that our proposed method works robustly in terms of naturalness even from features that lack acoustic information.

References

[1] Y. Koizumi, H. Zen, S. Karita et al., “Miipher: A robust speech restoration model integrating self-supervised speech and text representations,” in Proc. of IEEE WASPAA, 2023, pp. 1–5.
[2] T. Saeki, G. Wang, N. Morioka et al., “Extending multilingual speech synthesis to 100+ languages without transcribed data,” in Proc. of IEEE ICASSP, 2024, pp.11,546–11,550.
[3] Y. Koizumi, K. Yatabe, H. Zen et al., “WaveFit: an iterative and non-autoregressive neural vocoder based on fixed-point iteration,” in Proc. of IEEE SLT, 2022, pp.884–891.
[4] Y. Koizumi, H. Zen, K. Yatabe et al., “SpecGrad: Diffusion probabilistic model based neural vocoder with adaptive noise spectral shaping,” in Proc. of Interspeech, 2022, pp. 803–807.
[5] C. H. Lee, C. Yang, J. Cho et al., “RestoreGrad: Signal restoration using conditional denoising diffusion models with jointly learned prior,” in Proc. of ICML, 2025.
[6] Y. Koizumi, H. Zen, S. Karita et al., “Libritts-r: A restored multi-speaker text-to-speech corpus,” in Proc. of Interspeech, 2023, pp. 5496–5500.
[7] S. Chen, C. Wang, Z. Chen et al., “WavLM: Largescale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1505–1518, 2022.
[8] A. Babu, C. Wang, A. Tjandra et al., “XLS-R: selfsupervised cross-lingual speech representation learning at scale,” in Proc. of Interspeech, 2022, pp. 2278–2282.
[9] A. Radford, J. W. Kim, T. Xu et al., “Robust speech recognition via large-scale weak supervision,” in Proc. of ICML, vol. 202, 2023, pp. 28,492–28,518.

Table of contents

Background and Motivation

Proposed method

Overall results

Speech samples

Compared to baselines

Impact of intermediate outputs

SSL layer-wise analysis

References