As part of several initiatives that Google has taken up in India to improve Indic LLM capabilities, Google Pay vice president and GM Ambarish Kenghe announced the launch of IndicGenBench.
A benchmark to help in evaluating the generative capabilities of Indic LLMs, IndicGenBench is part of a slew of updates released during Google I/O Bengaluru 2024. Kenghe said that the benchmark covers as many as 29 languages, including several Indian languages that do not currently have benchmarks.
Speaking to AIM, Google Cloud director of customer engineering and field CTO Subram Natarajan said, “In India, there are two main areas of focus: Addressing language-related issues, while the second involves large-scale transformations across various industries, be it in customer engagement or addressing the broader needs of the Indian population.”
With a focus on improving language-related issues, Kenghe announced the open sourcing of DeepMind’s Composition to Augment Language Models (CALM), allowing developers to combine specialised language models with Google’s Gemma models. Interestingly, research on CALM had been done specifically by the Google DeepMind and Google Research teams in India, with the paper released earlier this year.
“Let’s say you’re building a coding assistant that can converse in English. Now, by composing a Kannada specialist model with CALM, you may be able to offer coding assistance to Kannada users as well,” explained Kenghe.
This focus on Indic language LLMs comes as DeepMind expands Project Vaani, a collaborative effort between Google and the Indian Institute of Science (IISc), wherein over 14,000 hours of speech data in 58 languages, has been made accessible to developers. This data was collected from over 80,000 speakers in 80 districts across the country.
As previously covered by AIM, this is being open-sourced as part of MeitY’s flagship AI initiative, Bhashini. These capabilities are also soon to be expanded as Bhashini also launched an initiative called Bhasha Daan, to help crowdsource voice and text data in multiple Indian languages.