Cropin’s Aksara AI model is a perfect example of how you can build a solution on top of open-source models. Aksara is a micro language model built on top of Mistral-7B-v0.1, aiming to democratise agricultural knowledge to empower farmers.
There are other models like OpenHathi and Tamil LLaMA that are built on open-source models trying to break the language barrier.
Sure, there are initiatives and companies that are building LLMs from scratch in India, like Mahindra’s Project Indus, Sarvam AI, and Krutrim AI, but they have yet to be released to the general public, and for now, open-source LLMs are the only way forward.
As Nandan Nilekani rightly pointed out, India’s focus should be on using AI to make a difference in people’s lives. “We are not in the arms race to build the next LLM, let people with capital, let people who want to pedal ships do all that stuff… We are here to make a difference, and our aim is to put this technology in the hands of people,” Nilekani said.
Multiple Languages? Open Source is Here to Help
Apart from cost and other resources, having 22 official languages and hundreds of dialects is a major challenge in building an AI model for India. Here’s where the core features of open source come into play.
To solve this issue, India can use MoE (Mixture of Experts) to blend available language-specific models like Tamil LLaMA and Kannada LLaMA to create one multilingual model running on minimal resources, solving the language barrier problem.
Also, it is quite easy to train your model when you have an existing one in a neighbouring language. For example, if you want to train a model in the Avadhi language and you have available LLMs for Hindi, then taking it forward for Avadhi will be quite easy compared to building it from scratch.
Open-source LLMs like BLOOM and IndicBERT, which are already pre-trained in multiple Indian languages, are a perfect example of how easy it will be to jumpstart the development of multilingual LLMs.
Initiatives like Core ML from Wadhwani AI are supposedly working on creating reusable libraries and open-source their data and code so that their efforts can be reused for further development.
Another good example is how Google/Flan-T5-XXL was used for legal text analysis, specifically focusing on the Indian Constitution. This is yet another direct indication of how open source is helping Indian citizens.
Costs are reduced drastically
Training a large model like GPT-3 from scratch is estimated to cost anywhere from $4 to $10 million or more, and some models are on par or better than GPT-3 for free. For any developing country like India, it will make sense to use such open-source models rather than spending millions (or billions) on training in 22 languages.
Research shows that data scientists spend almost 50% of their time cleaning their data. This becomes an even worse problem when you deal with multiple Indian languages and dialects with their own quirks, taking into account sarcasm, ambiguity, and irony.
Opting for an open-source model with pre-trained data saves a lot of time to build something helpful around it. When you want to build something around open-source LLMs, you have the advantage of transfer learning, where you use data captured by pre-trained models through training on large datasets that can be transferred to new tasks. This can help a lot of new Indian AI startups that don’t have enough resources to train data from scratch.
For India, building AI from scratch while using open-source LLMs parallelly makes sense, as you are already leveraging AI to solve problems. However, building LLMs from scratch can also brighten the Indian AI ecosystem.
When you work with an open-source model, it is already pre-trained, and there’s flexibility in training it further in a specific language and dialect. Furthermore, users worldwide can contribute to your project with datasets that never made it to your list, making it way more robust than a closed-source model.