BharatGPT Aims to Become Meta of Indic LLMs

BharatGPT believes in the power of open source as India is one of the largest contributors.

Share

Published on January 23, 2024

by Mohit Pandey

When it comes to India and generative AI, all of the recent Indic language models have been built on top of models built by Meta or Mistral. The Indian research community, though is rising up in the realm, wants to build models from scratch, and that is where BharatGPT is finding its moat.

Speaking to AIM, Ganesh Ramakrishnan, professor at IIT Bombay who is leading the BharatGPT initiative said that there is definitely a need for a foundational model for Indic languages. “We are building foundational models from scratch and that is what is keeping us busy,” said Ramakrishnan about how BharatGPT will mark India on the global AI map.

Every company and every country is racing towards being the best at generative AI, more specifically LLMs. In the beginning it was the US companies such as Google, Microsoft, and OpenAI building the best AI models, now it’s Meta with Llama 2 and Mistral from France with its open source models that can be touted as leading the way. Same goes for China with Baidu and Alibaba leading research. It is only time that India makes its mark.

Open source is the winner

The Indian academic institutions know the power of open source, especially when it comes to specialised models that are focused on healthcare or any other specific field. For example, researchers from IIT Patna recently unveiled a massive multilingual dataset for healthcare called MedSumm.

“The hope is that once we put something out, the bandwagon then begins. Folks put more stuff out and like I said, the dataset piece is very important because once you make that available, people start to utilise it in various different ways,” Aman Chadha, the researcher of the paper, told AIM.

BharatGPT is also building on with a similar philosophy. Even though BharatGPT is currently focused on building foundational models for India, the initiative has also been open-sourcing a lot of its work on Decile (similar to GitHub and Hugging Face), along with licence-permitting commercial usage, and aims to continue doing it.

Ramakrishnan also emphasised that the solutions need to be developed across different verticals as well such as banking, healthcare, farming, etc. “Mistral got France on the AI map. We want India to get on the AI map with BharatGPT.”

He said that while we can keep writing research papers and creating graduates, it is important that all these help bolster the Indian ecosystem. “I think it’s time to not just create some AI solution for India, but a working full stack,” Ramakrishnan added.

“We want everyone to use generative AI,” Vishnu Vardhan, the founder of Vizzhy, who is also the GPU buddy of BharatGPT told AIM, that the initiative would not just be about releasing weights, but making it available to everyone. He highlighted that the first model would be open source and wants developers to help them make it better. “The more people use it, the better it will become,” and eventually, they would release more versions of the model.

Vardhan emphasised that English models such as Llama and Mistral are good, but cannot address the complexity of Indian languages. He said that building Indic LLMs with a million tokens on top of English foundational models, which have trillions of token, won’t be enough to make them on par with GPT-4 for other languages such as Kannada or Tamil or Hindi.

He highlighted that Indic languages are very well connected with each other with their grammatical structure, which is very different from English. “That is why we decided that it would be easier to build models if they are originally in any Indic languages such as Hindi,” he explained, as it would be easier to mould them into other languages later.

The BharatGPT team is also working on video and speech models which would be released soon after the launch of the initial text-based foundational model.

The need for open source Indic models

Indian developers are one of the biggest contributors to open source. According to GitHub, with 13.2 million developers now on GitHub in India, the nation has firmly secured its position as the second-largest contributor to AI projects worldwide, just behind the United States. Of the total number 3.5M new developers joined GitHub in 2023.

All of this is while Meta, the open source giant is aiming for AGI goals with its upcoming Llama 3 model, and has a total of 350k H100 GPUs. BharatGPT team is also looking to acquire a lot more GPUs to continue training its models and also create an AI lab at IIT Bombay, where developers can train their own models.

“Nobody else will build these Indic language models, and it’s definitely needed at the moment,” Amit Sheth, professor at Arizona State University who recently met Prime Minister Narendra Modi for discussions on AI policy, told AIM expressing enthusiasm about the rise of Indic LLMs such as Kannada, Tamil, and Telugu models based on Meta’s Llama 2 model. But he also highlighted that India needs to build models from scratch.

Sheth also acknowledged that it is a very expensive and compute heavy task. He believes that we are not ready yet to train a model to compete with OpenAI’s GPT-4 as it would take millions of dollars and compute, which is hard to acquire at the moment. Plus, he also says that we also need to make sure about gathering data ethically.

He believes that this can be achieved with academic and private partnerships, which BharatGPT, Ola’s Krutrim, Sarvam AI, and Tech Mahindra’s Project Indus are also on the path. “A lot more research needs to be done, and I think high-quality original research is growing,” he added.

It only makes sense that BharatGPT becomes the open source foundational model for Indic language models, and as the researchers say, it is clearly headed that way.

📣 Want to advertise in AIM? Book here