A few weeks ago, Tech Mahindra announced the launch of Project Indus – an Indic-based foundational model for Indian languages, which could potentially prove to be its most important project ever. Large language models (LLMs) like the GPT models by OpenAI, despite their multilingual capabilities, have been predominantly trained on English datasets, which limits their proficiency in comprehending and generating content in Indic languages. Hence, an open-source Indic LLM will be hugely beneficial for India.
According to Tech Mahindra’s chief CP Gurnani, the model will be the biggest Indic LLM and could possibly cater to 25% of the world’s population. While Tech Mahindra has not revealed the cost associated with the project or when the model is expected to be launched, the aim is to build a 7-billion parameter LLM to begin with, Nikhil Malhotra, global head-Makers Lab, Tech Mahindra, told AIM.
The model is expected to initially support 40 different Hindi dialects and more languages and dialects will be added subsequently. “We understand that much work has been done on the Indic suite like Bhashini and AI4 Bharat, etc., but a foundation model still needs to be developed. As we continue to develop the model, we are constantly learning and improving the process. Our interface could have voice and textual information; however, we haven’t considered incorporating a chat interface like ChatGPT yet,” Malhotra said.
The primary goal for Tech Mahindra is to first create an LLM for continuation of text and then provide a dialogue. “Once we are clear that the model performs well and generates dialects well, we would launch it in the open source.”
Benefits of building India’s biggest Indic LLM
ChatGPT, driven by OpenAI’s GPT models, has undoubtedly been groundbreaking. Hence, developing an LLM, primarily designed for Indic languages could be highly beneficial for India for a wide array of reasons. Understanding the nuances of local cultures and contexts is essential for effective communication. An Indic LLM can be designed to prioritise cultural sensitivity, ensuring that the generated content respects local customs and norms. An Indic LLM could also democratise AI and cater to the wider section of non-English speakers in the country.
“One of the benefits of a foundation model is its versatility. For instance, a language model is capable of performing multiple tasks such as Q&A, fill-in-the-blanks, etc. using the same model. This approach is beneficial for specialised healthcare, retail, and tourism industries,” Malhotra said.
Moreover, the cost of tokens is significantly higher for the Indic languages in the GPT models when compared to English. Hence, an Indic LLM offers a more cost-effective solution for generating content in Indic languages without token pricing constraints. “It represents unrepresented languages and hence helps preserve them. Being the forerunner in this space, Tech Mahindra stands to benefit from the model. In fact, techniques from the model can be leveraged to benefit our customers,” he added.
Building Indic datasets
The effectiveness of an AI model hinges on the quality of its datasets. While ample English datasets are readily accessible, there is a scarcity of datasets for Indic languages and dialects. Recognising this challenge, various stakeholders, including the Indian government, are actively engaged in the creation of such datasets.
Last year, Prime Minister Narendra Modi launched the Bhasini project, which aims to develop language translation technologies that can effectively translate content from one Indian language to another. The initiative also aims to crowd-source voice datasets in multiple Indian languages to enhance the availability and accessibility of digital services in local languages.
Moreover, educational institutions such as the Indian Institute of Science (IISc) and IIT Madras (Ai4Bharat) and even Microsoft are involved in building datasets for Indic languages. “Despite various efforts, in India, datasets for languages other than Hindi are scarce and incomplete. Additionally, even Hindi data is fragmented,” Malhotra said. Additionally, he confirmed that Tech Mahindra is actively in talks with leading universities and other stakeholders for Project Indus.
Tech Mahindra is sourcing information from various platforms, including Common Crawl, newspapers, Wikipedia and YouTube descriptions. “The information on dialects is primarily available through YouTube videos or spoken language samples. We are also sourcing information commonly available on the internet from books written in specific dialects.” Besides, Malhotra has also acknowledged that the requirement for computation and its accessibility is a key challenge for the tech giant.
Banking on Bhasha Daan
For Tech Mahindra and for the success of Project Indus, the biggest challenge is gathering data for different dialects. For this, the IT giant is seeking contributions from the speakers of these dialects to help build the datasets. “For this reason, we have opened a portal to get a bhasha daan from Indians who speak that dialect.
By clicking “Make a Contribution” on our website, you will find a user-friendly interface with all the dialects in which we collect data. Once you select a dialect, you can listen to a sample voice recording of how Hindi is spoken in that particular dialect. Users can then scroll down and anonymously record a sentence by clicking the record button.”
Gurnani took to X (formerly Twitter) to request contributions from the general public to assist in the creation of datasets for Indic dialects. “We humbly request a bit of bhasha daan from you. Please lend us your expressions, your vocabulary, your conversations and help us train India’s biggest indigenous LLM,” he posted.
Mitigating biases in datasets
Oftentimes, the biases that manifest in AI models originate from biases present within the datasets. Since LLMs learn from a vast amount of text data available on the internet, when not appropriately addressed, these biases can impact the output generated by the models.
While building datasets from scratch, Tech Mahindra must put guardrails in place to ensure this does not happen. “When we collect the data at the first phase, it is essential to realise that this data would have to go through cleaning to ensure there is no bias. To address this challenge, we would be using both human annotation and automatic techniques to ensure there is no racial, ethnic, or gender bias, etc.,” Malhotra said.
While it’s a commendable move, the success of the project hinges upon various factors such as robust data collection, efficient model training, and addressing linguistic nuances.