UHG
Search
Close this search box.

Microsoft Rolls Out VALL-E 2, Attains Human-Level Speech Synthesis

This model rolls out two new features called Repetition Aware Sampling and Grouped Code Modeling to improve the stability and efficiency of the speech synthesis process.

Share

Illustration by Nikhil Kumar

Building on the success of VALL-E, Microsoft has introduced VALL-E 2, a neural codec language model designed to achieve human-level performance in zero-shot text-to-speech (TTS) synthesis. 

This model rolls out two new features called Repetition Aware Sampling and Grouped Code Modeling to improve the stability and efficiency of the speech synthesis process.

Let’s take a look at the new methods. 

  1. Repetition Aware Sampling: This method refines the traditional nucleus sampling by considering token repetition in the decoding history to improve stability and prevent the infinite loop issues encountered in earlier models.
  2. Grouped Code Modeling: This technique organises codec codes into groups to reduce sequence length, thereby speeding up inference and addressing the challenges associated with long sequence modelling.

These innovations enable VALL-E 2 to synthesise speech with high accuracy and naturalness, even for complex sentences. The model requires only simple speech-transcription pair data for training, simplifying the data collection and processing.

The model has been evaluated on the LibriSpeech and VCTK datasets, demonstrating superior performance in speech robustness, naturalness, and speaker similarity compared to previous systems. It is the first model to achieve human parity on these benchmarks, producing high-quality speech for complex and repetitive sentences.

Read the full paper here. 

What Makes VALL-E 2 Better

In January of 2023, the company had come up with VALL-E which demonstrated in-context learning capabilities in zero-shot scenarios after being pre-trained on 60,000 hours of English speech data.

However, it faced issues with stability and efficiency. VALL-E relied on random sampling, which could lead to unstable outputs, and its autoregressive architecture resulted in slow inference speeds. 

Follow-up works have tried to address these problems by leveraging text-speech alignment information and non-autoregressive methods, but these approaches introduced new complexities and limitations.

The capabilities of VALL-E 2 can be particularly beneficial for generating speech for individuals with speech impairments, such as those with aphasia or amyotrophic lateral sclerosis. 

While the new model has significant potential, it also carries risks of misuse, such as voice spoofing or impersonation. The model assumes user consent for voice synthesis. In real-world applications, it should include protocols for speaker approval and detection of synthesised speech to prevent abuse.

📣 Want to advertise in AIM? Book here

Picture of Shritama Saha

Shritama Saha

Shritama (she/her) is a technology journalist at AIM who is passionate to explore generative AI with a special focus on big techs, database, healthcare, DE&I, hiring in tech and more.
Related Posts
19th - 23rd Aug 2024
Generative AI Crash Course for Non-Techies
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord-icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.