Google has released MusicLM, a generative model for creating high-fidelity music from text descriptions, such as “a calming violin melody supported by a distorted guitar riff”. MusicLM makes music at 24 kHz that holds steady for several minutes by modelling the process of conditional music synthesis as a hierarchical sequence-to-sequence modelling problem.
According to tests, MusicLM works better than older systems in terms of audio quality and fidelity to the written descriptions. MusicLM can be conditioned on both text and a melody by changing whistled and hummed melodies to match a text caption’s description of that style.
It also unveiled MusicCaps, the first evaluation dataset collected specifically for the task of text-to-music generation. It is a hand-curated, high-quality dataset of 5.5k music-text pairs prepared by musicians.
Read the full paper here.
Key Features
MusicLM can create music from any text description. Plus, if the audio of a melody is given, it can generate new music inspired by that melody customized by prompts. It turned someone humming ‘Bella Ciao’ into a cappella chorus. It can generate audio with stories and progression and also generate music from paintings.
Training Process
Each stage is modelled as a sequence-to-sequence task leveraging decoder-only Transformers.
During training, MuLan audio tokens, semantic tokens, and acoustic tokens from the audio-only training set are extracted.
In the semantic modelling stage, semantic tokens are predicted using MuLan audio tokens as conditioning.
In the next acoustic modelling stage, the model predicts acoustic tokens with both MuLan audio tokens and semantic tokens.
During inference, MuLan text tokens, computed from the text prompt, are used as a conditioning signal and convert the generated audio tokens to waveforms using the SoundStream decoder.
Limitations
Some limitations of the method are inherited from MuLan, in that the model misunderstands negations and does not adhere to the precise temporal ordering described in the text.
The Music DALL-E
Similarly to how DALL-E 2 uses CLIP for text encoding, MusicLM is based on a joint music-text embedding model for the same purpose. But unlike DALL-E 2, which uses a diffusion model as a decoder, MusicLM’s decoder is based on AudioLM.
Two weeks ago, Microsoft released VALL-E, a new language model approach for text-to-speech synthesis (TTS) that uses audio codec codes as intermediate representations. It demonstrated in-context learning capabilities in zero-shot scenarios after being pre-trained on 60,000 hours of English speech data.
However, Google has announced it will not make MusicLM available to the public due to potential risks. These include the possibility of programming biases leading to underrepresentation and cultural appropriation, technical errors, and the risk of unauthorized use of creative content.