Kyutai, a French non-profit AI research laboratory, has introduced Moshi, a real-time native multimodal foundational AI model. This open-source project features voice-enabled AI assistant offering capabilities that rival OpenAI’s GPT-4o and Google Astra.
Moshi, developed by a team of just eight researchers in six months, can understand and express 70 different emotions and styles, speak with various accents, and handle two audio streams simultaneously, allowing it to listen and talk at the same time.
Built on the Helium 7B model, Moshi integrates text and audio training, optimised for CUDA, Metal, and CPU backends with support for 4-bit and 8-bit quantization.
Key features of Moshi include:
- Real-time interaction with end-to-end latency of 200 milliseconds
- Ability to run on consumer-grade hardware, including MacBooks
- Support for multiple backends (CUDA, Metal, CPU)
- Watermarking to detect AI-generated audio (in progress)
Kyutai chief Patrick Pérez said that the Moshi has the potential to revolutionize human-machine communication, saying, “Moshi thinks while it talks”.
Kyutai plans to release the full model, including the inference codebase, the 7B model, the audio codec, and the optimised stack.
Founded in November 2023 with €300 million in backing from investors including French billionaire Xavier Niel, the startup aims to contribute to open research in AI and foster ecosystem development.
The lab’s approach challenges major AI companies like OpenAI, which have faced criticism for delaying releases due to safety concerns. Notably, OpenAI has been withholding the release of its video generation model Sora, as well as the Voice Engine and voice mode features of GPT-4o.
Moshi contributes to France’s increasing influence in the AI sector, alongside other French-origin projects such as Hugging Face and Mistral.