Looking at the dire need to build AI in India, by India, and for India, GreyOrange AI research scientist Guneet Singh Kohli went on a unique journey. He began working on Hugging Face’s Data is Better Together initiative in partnership with Daniel van Strien from Hugging Face and as the first step introduced Sanskriti Bench.
The aim of Sanskriti Bench is to develop an Indian cultural benchmark to test the increase of Indic AI models. By crafting a benchmark with the help of native speakers from different regions across India, the initiative aims to take into account the country’s cultural diversity.
The initiative is also being built with the help of Silo AI’s Dr Shantipriya Parida, who also created Odia Llama, Anindyadeep Sannigrahi from Prem AI, and Dr Kalyanamalini S, who is the language expert from Odia Generative AI.
Talking about the project with AIM, Kohli said that the most important and unique part about this project is that all of the data is novel, which means that it’s created by Indians from across the country to ensure diversity, accuracy, and quality of data. He said that this is not available in other datasets in Indic languages, which are essentially translations taken from English.
Apart from this, Kohli recently also partnered with GitHub and Save the Children to build AI tools for child safety and is preparing an AI system that can catch people who attempt to groom children online. “I write research papers, but eventually, there is no use if you can’t implement it for the people,” said Kohli.
The eventual goal Kohli has with this is to set up a global AI for Child Safety Lab, ahead of which he hopes to collaborate with several psychologists. “In India, children are using a lot of social media, and it becomes important for the country to also start talking about these,” he said, highlighting the importance of more conversations in the country as these are lacking when compared to the US and Europe.
The First Phase is Going Strong
According to the roadmap that Kohli laid out, the project is in its first phase. Currently, he is creating questions to build a dataset for benchmarking LLMs, which will then be hosted on the Hugging Face leaderboard.
In order to create the perfect questions, Kohli has taken the help of friends from different parts of the country. He gave an example of one friend from Bihar who provided questions in his native language, Maithili, along with the answers. However, the problem he highlighted was that these LLMs had a very big problem understanding context.
“We asked a question to an AI model about a festival from Bihar for which it was able to answer correctly with all the historical accuracy and the reasons for celebrating it. But when asked about more context on the festival, the model related the whole festival to Odisha,” said Kohli.
He explained that even though we can correct this by using our own knowledge of the culture, what about researchers who are using LLMs for research of Indian culture? “They would get it completely wrong,” said Kohli. Similarly, he highlighted how different states contribute to the country, like how Punjab is famous for agriculture and Gujarat is famous for driving the economy. All of this needs to be represented in the AI models as well with proper attribution.
To ensure these LLMs have geographical, cultural, historical, proverbial, and demographical knowledge about each part of the country in its native language, Kohli has started preparing the dataset.
He is working with several volunteers from Kashmir, Punjab, Kerala, and Assam to integrate the knowledge of each region into the dataset. “I am pushing for the idea that it needs to be completely human-driven,” he added, saying that he does not want to use synthetic data for creating questions as the foundation.
Currently, he is aiming for 500 questions per language and per region of the country, starting with 10 languages, which can be augmented using language models in later versions. “The beautiful part of India is each region has a unique language, which would make it diverse in itself,” said Kohli.
He is also working on incorporating figurative language, like the language used in proverbs, poems, similes, and other expressions, which are unique in each part of the country.
Indian Researchers Need to Get Together
With a BTech from Thapar Institute of Technology and working with a lot of non-profit organisations, Kohli’s motivation has always been to make technology work for the people. That is why he started working on the idea in 2020 with Cord.ai.
“I don’t want to call it my project. I eventually want to call it ‘by the people for the people’,” said Kohli, emphasising the fact that initiatives like these would be able to set up a benchmark created by Indians for anyone building AI models in the country, even if they are coming from outside the country.
“If anyone is creating an Indian model, it should be able to handle Indian culture,” said Kohli, highlighting that all the models coming up in the Indic AI space claiming to be the best need to be evaluated.
Talking about OpenAI coming to India and other Indian-based AI offerings such as Krutrim, Kohli said that it is important for researchers who are building different AI models in different languages to come together. He said that as the phases are completed, the initiative will be part of the community and everyone contributes to it.
Also, speaking about the recent launch of the Cohere Aya model in multiple languages, including Hindi, for which Kohli was one of the reviewers of the paper, he said, “If people from outside India can also do it for Indian languages, we being in India, why can’t we do it?”