UHG
Search
Close this search box.

Amazon Demos the Largest text-to-speech AI Model,  Big Adaptive Streamable TTS with Emergent Abilities

This model sets a new benchmark for speech synthesis.

Share

Amazon shared BASE TTS, a text-to-speech model. It was trained on 100,000 hours of public domain speech data, mainly in English but also including German, Dutch, and Spanish, making it a new standard for natural speech. 

The model uses a 1-billion-parameter Transformer and a convolution-based decoder for efficient text-to-speech conversion. This model introduces a new approach for analysing speech so as to distinguish between different voices. It also employs a technique called byte-pair encoding to reduce the size of the speech data to enhances the model’s efficiency and speed in processing and generating speech. 

BASE TTS shows new or ‘emergent’ capabilities as it’s trained with more data. With over 10,000 hours of training, it understands text better, allowing it to produce speech that sounds right for the context. The model can also handle complex language features like compound nouns and emotional expressions, showing its versatility. 

An example provided by the paper, ‘In the classroom, filled with the chatter of students sharing their holiday stories and the rustling of new textbooks, Mrs. Thompson, excited to embark on a new academic year, prepared a lesson that would challenge and inspire her students.’

The development of BASE TTS was developed from the idea that larger text-to-speech systems would get better with scale. BASE TTS not only has high-quality speech but also shows new skills, like pronouncing difficult texts correctly and using the right emotional tone. It performs better than other large text-to-speech systems, making it a leading model.

Another example where the audio changes the tone and whispers for the sentence, ‘A profound sense of realisation washed over Matty as he whispered, “You’ve been there for me all along, haven’t you? I never truly appreciated you until now.”’

BASE TTS could improve user experiences and help languages with few resources. It can mimic speaker characteristics with little reference audio, offering new ways to create synthetic voices for people who cannot speak. Amazon decided not to share BASE TTS openly to avoid misuse, highlighting ethical considerations in using advanced AI.

These capabilities which eluded speech models until now seems possible as demonstrated by BASE TTS.  The research team also highlights the importance of diverse speech data in representing different languages, ethnicities, dialects, and genders. They call for more research on how data affects the model and ways to make voice technology more inclusive.

Another similar model is MetaVoice, an open source 1.2B parameter foundational model for TTS. 

📣 Want to advertise in AIM? Book here

Picture of K L Krithika

K L Krithika

K L Krithika is a tech journalist at AIM. Apart from writing tech news, she enjoys reading sci-fi and pondering the impossible technologies, trying not to confuse it with reality.
Related Posts
19th - 23rd Aug 2024
Generative AI Crash Course for Non-Techies
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord-icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.