Amazon shared BASE TTS, a text-to-speech model. It was trained on 100,000 hours of public domain speech data, mainly in English but also including German, Dutch, and Spanish, making it a new standard for natural speech.
The model uses a 1-billion-parameter Transformer and a convolution-based decoder for efficient text-to-speech conversion. This model introduces a new approach for analysing speech so as to distinguish between different voices. It also employs a technique called byte-pair encoding to reduce the size of the speech data to enhances the model’s efficiency and speed in processing and generating speech.
BASE TTS shows new or ‘emergent’ capabilities as it’s trained with more data. With over 10,000 hours of training, it understands text better, allowing it to produce speech that sounds right for the context. The model can also handle complex language features like compound nouns and emotional expressions, showing its versatility.
An example provided by the paper, ‘In the classroom, filled with the chatter of students sharing their holiday stories and the rustling of new textbooks, Mrs. Thompson, excited to embark on a new academic year, prepared a lesson that would challenge and inspire her students.’
The development of BASE TTS was developed from the idea that larger text-to-speech systems would get better with scale. BASE TTS not only has high-quality speech but also shows new skills, like pronouncing difficult texts correctly and using the right emotional tone. It performs better than other large text-to-speech systems, making it a leading model.
Another example where the audio changes the tone and whispers for the sentence, ‘A profound sense of realisation washed over Matty as he whispered, “You’ve been there for me all along, haven’t you? I never truly appreciated you until now.”’
BASE TTS could improve user experiences and help languages with few resources. It can mimic speaker characteristics with little reference audio, offering new ways to create synthetic voices for people who cannot speak. Amazon decided not to share BASE TTS openly to avoid misuse, highlighting ethical considerations in using advanced AI.
These capabilities which eluded speech models until now seems possible as demonstrated by BASE TTS. The research team also highlights the importance of diverse speech data in representing different languages, ethnicities, dialects, and genders. They call for more research on how data affects the model and ways to make voice technology more inclusive.
Another similar model is MetaVoice, an open source 1.2B parameter foundational model for TTS.