Meta-StyleSpeech: Multi-Style Adaptive Text to Speech Generation
Dongchan Min, DongBok Lee, Eunho Yang, Sung Ju Hwang
Abstract
With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications.
For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length.
However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning.
In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers.
Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from a single speech audio.
Furthermore, to enhance StyleSpeech's adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
The experimental results show that our models generate high-quality speech which accurately follows the speaker's voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.
LibriTTS : 'if i presume to begin,' said blenkiron, it's because i reckon my story is the shortest.
Reference Audio
Model
GT
GT mel + MelGAN
DeepVoice3
GMVAE
Multi-speaker FS2(vanila)
Multi-speaker FS2+d-vector
StyleSpeech
Meta-StyleSpeech
LibriTTS : the standard of measurement these days is the ability to serve.
Reference Audio
Model
GT
GT mel + MelGAN
DeepVoice3
GMVAE
Multi-speaker FS2(vanila)
Multi-speaker FS2+d-vector
StyleSpeech
Meta-StyleSpeech
LibriTTS : after years of watching the processes of nature, he says, i can no more doubt the existence of an intelligence that is running things than i do of the existence of myself.
Reference Audio
Model
GT
GT mel + MelGAN
DeepVoice3
GMVAE
Multi-speaker FS2(vanila)
Multi-speaker FS2+d-vector
StyleSpeech
Meta-StyleSpeech
LibriTTS : now this child was too old to be nursed, as everybody told her; for he could run, say two yards alone, and perhaps four or five, by holding to handles.
Reference Audio
Model
GT
GT mel + MelGAN
DeepVoice3
GMVAE
Multi-speaker FS2(vanila)
Multi-speaker FS2+d-vector
StyleSpeech
Meta-StyleSpeech
(Section 4.3) Unseen Speakers Adaptation
1. Varying Lengths of Reference Audios
VCTK : i have the first six months of next season to prove myself.
Length of reference audio
GT
Meta-StyleSpeech
StyleSpeech
Multi-speaker FS2+d-vector
Multi-speaker FS2(vanila)
GMVAE
<1sec
1~3sec
1 sentence
2 sentence
VCTK : he's an excellent defender, strong and quick.
Length of reference audio
GT
Meta-StyleSpeech
StyleSpeech
Multi-speaker FS2+d-vector
Multi-speaker FS2(vanila)
GMVAE
<1sec
1~3sec
1 sentence
2 sentence
VCTK : he was a crazy man.
Length of reference audio
GT
Meta-StyleSpeech
StyleSpeech
Multi-speaker FS2+d-vector
Multi-speaker FS2(vanila)
GMVAE
<1sec
1~3sec
1 sentence
2 sentence
VCTK : i could live without the attention, he admitted.
Length of reference audio
GT
Meta-StyleSpeech
StyleSpeech
Multi-speaker FS2+d-vector
Multi-speaker FS2(vanila)
GMVAE
<1sec
1~3sec
1 sentence
2 sentence
2. Gender and Accent
VCTK
Gender
GT
Meta-StyleSpeech
Male
Female
VCTK
Accent
GT
Meta-StyleSpeech
American
Britsh
Indian
African
Australian
(Section 4.4) Ablation Studies
LibriTTS(seen) : after years of watching the processes of nature, he says, i can no more doubt the existence of an intelligence that is running things than i do of the existence of myself.
Meta-StyleSpeech
w/o $D_t$
w/o $D_s$
w/o $L_{cls}$
VCTK(unseen) : however, they continued in their pursuit of victory.