What is TTS?

text-to-speech, or speech generation, is the task of generating audio for text inputs.

in general, TTS methods are divided into concatenative or parametric methods:

concatenative methods rely on joining small pre-recorded audio segments (i.e. phones);

parametric methods convert text into a set of features, which are then rendered to audio with a vocoder or through algorithms such as Griffin-Lim*.* previous approaches did this in a three-step process, first rendering the text to a phone sequence, then converting the phone sequence to vocoder input features, which could include spectrograms, cepstra, fundamental frequencies, pitch information, etc.

modern neural approaches build upon the parametric framework by utilizing neural networks for feature generation (text → features, i.e. "TTS") and audio generation (the vocoder).

see: https://medium.com/sciforce/text-to-speech-synthesis-an-overview-641c18fcd35f

Atlas Labs TTS

our current TTS system consists of:

  • a pretrained speaker verification speaker embedding network (SV2TTS)

  • a Tacoton 2 based text-to-mel TTS network

  • a Waveglow vocoder for audio generation (with alternates available based on use case)

  • Zeroth-EE integration coming soon

TTS: Tacotron 2

Tacotron 2 is a model that converts text (or phone sequence) inputs to features that can be interpreted by a vocoder to generate high-quality audio. the Tacotron 2 authors use the mel-spectrogram as the feature vector, which is a spectrogram that has been scaled to emphasize the frequencies most important to intelligible speech audio.

Tacoton 2 is a sequence-to-sequence neural network with attention, built mostly from recurrent cells (the LSTM) with some convolutional layers for postprocessing. it simplifies the previous Tacotron model's complex CBHG layers.

the Tacotron 2 architecture

We use a modified Tacotron 2 architecture with the following adaptations:

  • Korean-specific input that leverages the unique property of Korean written text (thanks to King Sejong's brilliance)

  • Monotonic Step-wise Attention for better generation of long outputs (less repeating/skipping)

  • We replace the learnable speaker embeddings with pretrained Speaker Verification embeddings, which allow for producing multiple voices from a single model, faster multispeaker training and better generalization to new speakers through transfer learning. (see below)

see: https://ai.googleblog.com/2017/12/tacotron-2-generating-human-like-speech.html

Multispeaker SV2TTS Embeddings

We leverage transfer learning to incorporate speaker information though a pretrained model trained on hundreds of speakers for a speaker verification task. This means that audio from the same speaker will output similar vectors that encapsulate general information about the speaker's voice. Because this model is robust to things like noise and audio recording quality, we can train on a much larger amount of data (including non-Korean data) in order to create more fine-grained embeddings.

For example, when we cluster various speakers' information by by speaker (t-SNE clustering, by individual), we can see that all that speaker's audio clips have been grouped together. If we label speakers by gender, we can see that the embedding space captures information that differentiates gender.

t-SNE clustering, by individual
t-SNE clustering, by gender

"Voice Cloning"

Eventually, by using these as inputs to our Tacotron 2 TTS model, Tacotron can also learn these features. So, we can use transfer learning to fine-tune a general Tacotron 2 model to a new speaker's voice with less data, and, in theory, given a general model trained on enough different speakers, we could even generate voices for unseen speakers from only a few samples, just by inputting their speaker vector to the general model.

Vocoder: WaveGlow

the vocoder is the module that converts the TTS output features, in our case, the mel-spectrogram, back to the audio waveform. generally, modern vocoders generate audio of a much higher quality than traditional algorithmic methods such as mel-spectrogram pseudoinverse + Griffin-Lim.

Waveglow generates high-quality audio faster than real-time and generalizes to new speakers well (completely new speakers work well, and transfer learning improves quality further). in both our internal tests and outside evaluations, it has been found to be one of the highest-quality vocoders to generate audio near real-time.

We use a Waveglow model that has been fine-tuned on our data, and we provide further fine-tuning to adapt to new clients' voices.

see: https://www.audiolabs-erlangen.de/resources/NLUI/2019-SSW-NeuralVocoders/

see: https://nv-adlr.github.io/WaveGlow (includes samples with English)

Squeezewave, Parallel WaveGAN & MelGAN

While WaveGlow produces high-quality audio, its size makes serving at high-capacity an obstacle. Recently, smaller models have attempted to provide near-Waveglow quality audio at a fraction of the computational cost, allowing for more scalability.

Squeezewave attempts to reduce the size of the Waveglow model by replacing many of its more intensive computations with lighter-weight versions, trading some quality for much faster performance.

Parallel wavegan, MelGAN are composed of two separate neural networks: discriminator and generator. The Generator learns to generate realistic waveforms, while the discriminator learns to classify generated sample (fake) and ground truth (true). During adversarially training these two networks, the performance gets improved. Inference speed is over 20.

Last updated

Was this helpful?