BASE TTS

February 14, 2024

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for Big Adaptive Streamable TTS with Emergent abilities.

BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner.

Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding.

Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences.

Search This Blog

chatainews

BASE TTS

Comments

Post a Comment

Popular posts from this blog

When their AI chums have Bob's data

Hamza Chaudhry

Supporting Artistes (SAs)