UHG
Search
Close this search box.

Voice Slowly Catching Up on Multimodal AI Features

The sudden growth of lip-sync and voice integrated features to complement AI-generated videos is helping ‘voice’ find prominence in a multimodal model.

Share

Illustration by Nikhil Kumar

Eleven Labs, a voice technology research company that works in developing AI for speech synthesis and text-to-speech software, recently added voice to Sora’s generated videos, showcasing a holistic example of what voice can bring to AI-generated videos. While this is not the first time for a development like this, voice modality is increasingly being brought to the forefront. 

It’s Not All Easy with Voice 

Voice in AI modality is considered to have a uniquely difficult interface mechanism as it employs probabilistic AI as opposed to deterministic machine learning-based voice services such as Apple Siri, and other home assistant products. 

Technology investor Michael Parekh believes that the time to implement perfect AI voice modality on devices will be a lengthy one. “It’s going to be a long road to get it right, likely as long as it took to even get the previous versions like Apple Siri, Google Next, and Amazon Alexa/Echo especially, to barely tell us the time, set timers, and play some music on demand,” he said

Voice has also been chosen as a mode of interaction, which is evident through its implementation as a primary user interface in devices such as Rabbit. The Humane Ai Pin, a futuristic, small wearable AI device that can be pinned to one’s clothing, works on finger gestures and voice for operating the device. 

SoundHound Inc, an AI voice and speech recognition company founded almost a decade ago, developing technologies for speech recognition, NLP and more, have predicted in 2020 itself-  “Although, voice does not need to be the only method of interaction (nor should it be), voice assistants will soon become a primary user interface in a world where people will never casually touch shared surfaces again.” 

Voice for Video 

The stream of AI voice integration announcements spiked in the last few weeks. Pika Labs, which creates AI-powered tools for generating and editing videos, came to limelight a few months ago with a $55M funding. They recently announced an early access to the ‘Lip Sync’ feature for their Pro users that enables voice and dialogues for AI-generated videos.  

Alibaba’s EMO AI generator (Emote Portrait Alive) that generates expressive portrait videos with audio2video diffusion models was released last week to give a direct competition to Pika Labs. The company released videos where images were made to talk/sing with expressive facial gestures. 

Voice feature has also been integrated to simplify podcasts. Eleven Labs had partnered with Perplexity to bring ‘Discover Daily’, a daily podcast that will be narrated by Eleven Labs’ AI-generated voices: another use case of how combining voice technology with other functionalities can create tangible use cases. 

Theme for 2024 

In the top three AI trends that Microsoft identified for 2024, multimodal AI was one of them. “Multimodality has the power to create more human-like experiences that can better take advantage of the range of senses we use as humans, such as sight, speech and hearing,” said Jennifer Marsman, principal engineer in AI (Office of the CTO) at Microsoft. 

Microsoft’s efforts in the same direction is reflected in their AI offering Microsoft Copilot. Catering to enterprises and consumers alike, Copilot’s multimodal capabilities can process various formats including images, natural language and Bing search data. Multimodal AI also powers Microsoft Designer, a graphic image tool for creating designs, logos, banners and lots more with a simple text prompt. 

Latest AI kid on the block, Perplexity, has also integrated the multimodal features, where a user can upload images from their Pro account and get relevant answers based on that. There is a common theme from all these functionalities. Is ‘voice’ truly an added feature? 

Big Tech’s Foray Into Voice 

With the release of ChatGPT’s voice feature, that allows one to easily converse with the model, almost 6 months after launching a multimodal GPT 4 model, voice capability was fully integrated on GPT-4’s model. Google Gemini, the most powerful AI model of Google, is also a multimodal model. 

While the advancements are promising, misuse risks related to implementing it still persists, with the most prominent one being deepfakes. With an increasing number of companies entering the space, adding voice to AI-generated videos only adds to the propensity of abuse, where stringent copyright and privacy laws will be the only saviour.  

📣 Want to advertise in AIM? Book here

Picture of Vandana Nair

Vandana Nair

As a rare blend of engineering, MBA, and journalism degree, Vandana Nair brings a unique combination of technical know-how, business acumen, and storytelling skills to the table. Her insatiable curiosity for all things startups, businesses, and AI technologies ensures that there's always a fresh and insightful perspective to her reporting.
Related Posts
19th - 23rd Aug 2024
Generative AI Crash Course for Non-Techies
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord-icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.