The open source small language models are increasingly competing with their larger counterparts. The likes of 7-13 billion parameter models such as Meta’s LLaMA, Microsoft’s Phi-2, and Mistral are giving tough competition to OpenAI’s GPT-4 or Google’s Gemini.
Recently, Maxime Labonne, a researcher and open source enthusiast, released Phixtral, which combines the competence of Mistral’s Mixture of Experts model and Microsoft’s Phi-2.
AIM got in touch with Labonne, the creator of Phixtral and NeuralBeagle, to understand his views on the AI landscape, the importance of open source, and smaller language models being on par with larger closed ones.
“In the long run, open source is [going to be] too powerful and there are way too many people who want to experiment with it right now,” Labonne said. He believes that eventually open source models will overtake the closed ones. “It is like there are monkeys with a typewriter – one of them will create a masterpiece at some point,” he quipped about open source developers, who are not bound by companies and academic institutions.
Sometimes, the dumbest ideas work
Apart from being an open source contributor, Labonne also works at JPMorgan as a machine learning scientist, and is also an avid gamer. Calling himself ‘GPU poor’, Labonne currently juggles between using his computer to play Vampire Survivors and experimenting with open source language models.
Running experiments on his laptop, Labonne said that he is a huge fan of 7 billion parameter models. “I think it’s a sweet spot with a balance between the requirement of compute and the knowledge,” he added that larger models are obviously going to be good at reasoning, but distilling information through quantisation from large models into smaller models is something that he bets on.
“Currently, all the benchmarks that we have are bad,” laughed Labonne, highlighting that the two ways of benchmarking, one being automated and the other manual ones such as MMLU and HumanEVAL, are both problematic as they can be easily altered by fine-tuning on testing dataset.
Even then, he said that it is the best way that we have currently. “If you go to the leaderboard that I have created and pick the top model, it actually performs better than the rest,” Labonne said. He has built LLM-autoeval, a Google Colab where developers can just specify the name of the model, select a GPU, and run the tests.
Apart from this, Labonne has also created an LLM course with roadmaps and divided it into three parts – LLM fundamentals for beginners, LLM scientist for building LLMs, and LLM Engineer for building applications and deploying them. “A lot of people have been asking me from different backgrounds about how to get into LLMs, so I created this course for people from every background,” he added.
Currently, Labonne is working on creating his own benchmark and also a very passionate project called ‘Chess LLM’, where developers can create their own test dataset, train very small language models, and make them compete in an arena. “They are very bad at playing chess right now, and it is very funny to see the leaderboard.”
The best open source model and AGI
There is a group of people in the AI realm – who occasionally identify themselves as so-called AI ethicists – that are concerned with generative AI taking over the internet, and eventually leading to models collapsing when being trained on synthetic data generated by AI models. “I think it’s just the opposite,” Labonned replied, saying that synthetic data is actually proving to be a lot more effective than initially thought.
“Now we have solid evidence to prove that synthetic data is of really high quality and it can’t ruin the internet,” Labonne gave examples of models such as Orca, that is trained on GPT-4 data and actually performs a lot better than others.
Labonne has been experimenting with a lot of open source models such as Mistral, LLaMA, Phi-2, Falcon, and models from China such as DeepSeek. According to him, Falcon was great when it came out and was on top of the charts, but now models such as Mistal and Llama 2 are outperforming everyone.
“Phi-2 is great, but I think Mistral’s models are the best right now,” he said, and added that Chinese models are not suitable for English use cases at all.
Labonne says that OpenAI’s strategy of going closed door since GPT-3 is good for the company, but Meta’s open source approach is something that makes him really happy. “I am not very interested in the AGI conversation, I don’t even see how it’s possible right now,” he laughed and wished the companies striving for it, the best of luck.
“We will talk about AGI after 10 years,” he concluded, adding the LLMs are one of the most powerful tools, but we need something stronger to achieve that, and it is not cost effective to keep training bigger models such as GPT-5, as we have reached almost a saturation point.