Introduction
The Zeroth-TTS service provides an API for speech synthesis, or text-to-speech (TTS).
Currently, it supports Korean language input with multiple synthesized voices. If training data is supplied, new voices can be supported.
The TTS engine consists of two components:
the encoder is the primary service that executes the TTS model
the workers manage connections between the encoder and the Zeroth master, as well as conducting some post-processing
The TTS model is powered by deep learning and consists of two elements:
the text-to-audio system converts text inputs into an intermediate representation (melspectrogram) of the audio. we use a modified tacotron-2 for this.
the vocoder converts the melspectrogram into an audio signal. we support a variety of GAN models, primarily multi-band melGAN and Parallel waveGAN.
Last updated
Was this helpful?