Meta-StyleSpeech: Multi-Style Adaptive Text to Speech Generation

Dongchan Min, DongBok Lee, Eunho Yang, Sung Ju Hwang

Abstract
With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from a single speech audio. Furthermore, to enhance StyleSpeech's adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker's voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.

[paper] [code]

(Section 4.2) Evaluation on Trained Speakers
(Section 4.3) Unseen Speakers Adaptation
1. Varying Lengths of Reference Audios
2. Gender and Accent
(Section 4.4) Ablation Studies

(Section 4.2) Evaluation on Trained Speakers

LibriTTS : 'if i presume to begin,' said blenkiron, it's because i reckon my story is the shortest.

Reference Audio	Model
	GT
	GT mel + MelGAN
	DeepVoice3
	GMVAE
	*Multi-speaker FS2(vanila)*
	Multi-speaker FS2+d-vector
	StyleSpeech
	Meta-StyleSpeech

LibriTTS : the standard of measurement these days is the ability to serve.

Reference Audio	Model
	GT
	GT mel + MelGAN
	DeepVoice3
	GMVAE
	*Multi-speaker FS2(vanila)*
	Multi-speaker FS2+d-vector
	StyleSpeech
	Meta-StyleSpeech

LibriTTS : after years of watching the processes of nature, he says, i can no more doubt the existence of an intelligence that is running things than i do of the existence of myself.

Reference Audio	Model
	GT
	GT mel + MelGAN
	DeepVoice3
	GMVAE
	*Multi-speaker FS2(vanila)*
	Multi-speaker FS2+d-vector
	StyleSpeech
	Meta-StyleSpeech

LibriTTS : now this child was too old to be nursed, as everybody told her; for he could run, say two yards alone, and perhaps four or five, by holding to handles.

Reference Audio	Model
	GT
	GT mel + MelGAN
	DeepVoice3
	GMVAE
	*Multi-speaker FS2(vanila)*
	Multi-speaker FS2+d-vector
	StyleSpeech
	Meta-StyleSpeech

(Section 4.3) Unseen Speakers Adaptation

1. Varying Lengths of Reference Audios

VCTK : i have the first six months of next season to prove myself.

Length of reference audio	GT	Meta-StyleSpeech	StyleSpeech	Multi-speaker FS2+d-vector	Multi-speaker FS2(vanila)	GMVAE
<1sec
1~3sec
1 sentence
2 sentence

VCTK : he's an excellent defender, strong and quick.

Length of reference audio	GT	Meta-StyleSpeech	StyleSpeech	Multi-speaker FS2+d-vector	Multi-speaker FS2(vanila)	GMVAE
<1sec
1~3sec
1 sentence
2 sentence

VCTK : he was a crazy man.

Length of reference audio	GT	Meta-StyleSpeech	StyleSpeech	Multi-speaker FS2+d-vector	Multi-speaker FS2(vanila)	GMVAE
<1sec
1~3sec
1 sentence
2 sentence

VCTK : i could live without the attention, he admitted.

Length of reference audio	GT	Meta-StyleSpeech	StyleSpeech	Multi-speaker FS2+d-vector	Multi-speaker FS2(vanila)	GMVAE
<1sec
1~3sec
1 sentence
2 sentence

2. Gender and Accent

VCTK

Gender	GT	Meta-StyleSpeech
Male
Male
Female
Female

VCTK

Accent	GT	Meta-StyleSpeech
American
Britsh
Indian
African
Australian

(Section 4.4) Ablation Studies

LibriTTS(seen) : after years of watching the processes of nature, he says, i can no more doubt the existence of an intelligence that is running things than i do of the existence of myself.

Meta-StyleSpeech	w/o $D_t$	w/o $D_s$	w/o $L_{cls}$

VCTK(unseen) : however, they continued in their pursuit of victory.

Meta-StyleSpeech	w/o $D_t$	w/o $D_s$	w/o $L_{cls}$