Building on the success of VALL-E, Microsoft has introduced VALL-E 2, a neural codec language model designed to achieve human-level performance in zero-shot text-to-speech (TTS) synthesis.
This model rolls out two new features called Repetition Aware Sampling and Grouped Code Modeling to improve the stability and efficiency of the speech synthesis process.
Let’s take a look at the new methods.
- Repetition Aware Sampling: This method refines the traditional nucleus sampling by considering token repetition in the decoding history to improve stability and prevent the infinite loop issues encountered in earlier models.
- Grouped Code Modeling: This technique organises codec codes into groups to reduce sequence length, thereby speeding up inference and addressing the challenges associated with long sequence modelling.
These innovations enable VALL-E 2 to synthesise speech with high accuracy and naturalness, even for complex sentences. The model requires only simple speech-transcription pair data for training, simplifying the data collection and processing.
The model has been evaluated on the LibriSpeech and VCTK datasets, demonstrating superior performance in speech robustness, naturalness, and speaker similarity compared to previous systems. It is the first model to achieve human parity on these benchmarks, producing high-quality speech for complex and repetitive sentences.
Read the full paper here.
What Makes VALL-E 2 Better
In January of 2023, the company had come up with VALL-E which demonstrated in-context learning capabilities in zero-shot scenarios after being pre-trained on 60,000 hours of English speech data.
However, it faced issues with stability and efficiency. VALL-E relied on random sampling, which could lead to unstable outputs, and its autoregressive architecture resulted in slow inference speeds.
Follow-up works have tried to address these problems by leveraging text-speech alignment information and non-autoregressive methods, but these approaches introduced new complexities and limitations.
The capabilities of VALL-E 2 can be particularly beneficial for generating speech for individuals with speech impairments, such as those with aphasia or amyotrophic lateral sclerosis.
While the new model has significant potential, it also carries risks of misuse, such as voice spoofing or impersonation. The model assumes user consent for voice synthesis. In real-world applications, it should include protocols for speaker approval and detection of synthesised speech to prevent abuse.