While fine-tuning with Indic language tokens on top of existing English models is a viable approach, building foundational models from scratch offers several advantages, and that is what BharatGPT is aiming to do.
“Existing models may not adequately represent the Indian cultural and linguistic diversity, which can lead to biases and limitations in their applicability,” said Professor Maunendra Sankar Desarkar from IIT Hyderabad, who is also a core team member of the BharatGPT initiative.
“Moreover, fine-tuning may not fully address the unique linguistic challenges posed by Indic languages,” he added. He further said that by building foundational models tailored to the Indian context, we can ensure greater inclusivity and effectiveness across diverse linguistic communities, which would deliver AI in the best possible way in India.
“We’re sourcing data from various repositories available on the web, including digitised books and datasets,” Desarkar added. He said that the OCR technology plays a crucial role in digitising textual content, though it’s not always error-free. “We’re exploring methods to detect and correct OCR errors algorithmically,” he added, highlighting the collaboration with organisations like BHASHINI.
Indic data is gold mine for global research
“We observed that many communities, including the Indian diaspora, produce content that differs from the polished English typically encountered,” explained Desarkar. This realisation prompted the researchers to explore how NLP techniques could be tailored to better serve diverse linguistic communities. “Consequently, our focus expanded to include domains such as healthcare management and travel planning, where NLP could offer valuable solutions.”
“Beyond the sheer volume of data required, there is also significant heterogeneity in data distribution, particularly in multilingual settings within India,” Desarkar added. This necessitates different algorithmic treatments and techniques to effectively handle and process diverse datasets, ensuring that models are robust and adaptable across different linguistic contexts, not just one.
Moreover, the identification of language similarities can form the development of intelligent techniques to improve model performance, particularly in languages with limited training data. By leveraging insights from languages like Hindi, which may share similarities with other languages, researchers can develop strategies to enhance performance in related linguistic domains that do not have enough available data, such as Bhojpuri, Desarkar explained.
“One of the things that we are very focused on is developing models that reach a vast audience, which includes a small hospital in a village without the available resources to run OpenAI’s model, for example,” said Desarkar. In this regard, he said that the BharatGPT initiative is also focusing on the development of smaller, more efficient modules that deliver comparable performance without requiring extensive infrastructure.
Desarkar said due to all these reasons, Indic data trained models would benefit the whole world due to the hidden knowledge and the depth that models would achieve with their algorithms.
Tailored solutions for India
A month ago, Desarkar’s paper titled ‘CharSpan: Utilising Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages‘ was accepted in the EACL 2024 main conference, which dealt with the low resource models. Another paper titled ‘Unsupervised Noise Injection to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages‘ was also selected for EMNLP 2023.
“While our long-term plans involve diverse sources, our immediate focus is on healthcare data obtained through partnerships with relevant organisations,” added Desarkar. “This data isn’t randomly sourced from the web but is generated through specific channels, ensuring its relevance and reliability,” he added, saying that it helps in giving unique and essential solutions for specific needs.
Desarkar said that while Indic models are on the rise, there is also a need for building a good metric and benchmark for these models. “We’re actively working on developing metrics tailored to Indic languages, which will enable more accurate evaluation of model performance,” he added, saying that fostering dialogue among researchers to establish a consensus on evaluation metrics is crucial for advancing the field.
Working in this field for more than a decade, and completing his PhD from IIT Kharagpur, Desarkar’s focus was on problems such as ranking, search, recommendation systems, and so forth. Over time, with the exponential increase in data available online, particularly NLP, became more apparent.
Even in areas like e-commerce and social media, where communication was initially concise, “we began to see longer-form content”, he said about the reason for interest in the field.
Talking about BharatGPT, Desarkar said that the computational demands are substantial. “While we’ve secured commitments from certain quarters, we may need to leverage cloud services to address this challenge,” he added. However, beyond hardware infrastructure, data availability is also critical.
“While we may have sufficient data for these languages, expanding beyond that presents challenges,” he added. Consequently, the BharatGPT team is exploring algorithmic solutions to make the most of limited data resources. “While progress is being made in setting up the necessary infrastructure, we’re also focusing on addressing algorithmic and data-related challenges, and this would be beneficial for the whole world,” concluded Desarkar.