Meta-StyleSpeech: Multi-Style Adaptive Text to Speech Generation

Dongchan Min, DongBok Lee, Eunho Yang, Sung Ju Hwang

Abstract
With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from a single speech audio. Furthermore, to enhance StyleSpeech's adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker's voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.

[paper] [code]

Contents

(Section 4.2) Evaluation on Trained Speakers
(Section 4.3) Unseen Speakers Adaptation
1. Varying Lengths of Reference Audios
2. Gender and Accent
(Section 4.4) Ablation Studies

(Section 4.2) Evaluation on Trained Speakers

LibriTTS : 'if i presume to begin,' said blenkiron, it's because i reckon my story is the shortest.

Reference Audio Model

GT

GT mel + MelGAN


DeepVoice3

GMVAE

Multi-speaker FS2(vanila)

Multi-speaker FS2+d-vector

StyleSpeech

Meta-StyleSpeech

LibriTTS : the standard of measurement these days is the ability to serve.

Reference Audio Model

GT

GT mel + MelGAN


DeepVoice3

GMVAE

Multi-speaker FS2(vanila)

Multi-speaker FS2+d-vector

StyleSpeech

Meta-StyleSpeech

LibriTTS : after years of watching the processes of nature, he says, i can no more doubt the existence of an intelligence that is running things than i do of the existence of myself.

Reference Audio Model

GT

GT mel + MelGAN


DeepVoice3

GMVAE

Multi-speaker FS2(vanila)

Multi-speaker FS2+d-vector

StyleSpeech

Meta-StyleSpeech

LibriTTS : now this child was too old to be nursed, as everybody told her; for he could run, say two yards alone, and perhaps four or five, by holding to handles.

Reference Audio Model

GT

GT mel + MelGAN


DeepVoice3

GMVAE

Multi-speaker FS2(vanila)

Multi-speaker FS2+d-vector

StyleSpeech

Meta-StyleSpeech

(Section 4.3) Unseen Speakers Adaptation

1. Varying Lengths of Reference Audios

VCTK : i have the first six months of next season to prove myself.

Length of reference audio GT Meta-StyleSpeech StyleSpeech Multi-speaker
FS2+d-vector
Multi-speaker
FS2(vanila)
GMVAE

<1sec

1~3sec

1 sentence

2 sentence

VCTK : he's an excellent defender, strong and quick.

Length of reference audio GT Meta-StyleSpeech StyleSpeech Multi-speaker
FS2+d-vector
Multi-speaker
FS2(vanila)
GMVAE

<1sec

1~3sec

1 sentence

2 sentence

VCTK : he was a crazy man.

Length of reference audio GT Meta-StyleSpeech StyleSpeech Multi-speaker
FS2+d-vector
Multi-speaker
FS2(vanila)
GMVAE

<1sec

1~3sec

1 sentence

2 sentence

VCTK : i could live without the attention, he admitted.

Length of reference audio GT Meta-StyleSpeech StyleSpeech Multi-speaker
FS2+d-vector
Multi-speaker
FS2(vanila)
GMVAE

<1sec

1~3sec

1 sentence

2 sentence

2. Gender and Accent

VCTK

Gender GT Meta-StyleSpeech

Male

Female

VCTK

Accent GT Meta-StyleSpeech

American

Britsh

Indian

African

Australian

(Section 4.4) Ablation Studies

LibriTTS(seen) : after years of watching the processes of nature, he says, i can no more doubt the existence of an intelligence that is running things than i do of the existence of myself.

Meta-StyleSpeech w/o $D_t$ w/o $D_s$ w/o $L_{cls}$

VCTK(unseen) : however, they continued in their pursuit of victory.

Meta-StyleSpeech w/o $D_t$ w/o $D_s$ w/o $L_{cls}$