Introduction

The Zeroth-TTS service provides an API for speech synthesis, or text-to-speech (TTS).

Currently, it supports Korean language input with multiple synthesized voices. If training data is supplied, new voices can be supported.

The TTS engine consists of two components:

  • the encoder is the primary service that executes the TTS model

  • the workers manage connections between the encoder and the Zeroth master, as well as conducting some post-processing

The TTS model is powered by deep learning and consists of two elements:

  • the text-to-audio system converts text inputs into an intermediate representation (melspectrogram) of the audio. we use a modified tacotron-2 for this.

  • the vocoder converts the melspectrogram into an audio signal. we support a variety of GAN models, primarily multi-band melGAN and Parallel waveGAN.

Last updated

Was this helpful?