Mark Zuckerberg seems to be on the right track as Meta prepares to unveil the next iteration of Llama 3.1, expected to be released today. The new version will come in three sizes: 8B, 70B, and 405B, with a context length of 128K.
However, even before Meta could officially release the model, its benchmark card was leaked, and it is doing rounds on social media.
Leaks look credible, Llama 3 405B is GPT-4o class ⭐️
— Maxime Labonne @ ICML (@maximelabonne) July 23, 2024
It even outperforms GPT-4o (72.55%) and Claude 3.5 Sonnet (72.83%) in terms of MMLU PRO. The new 70B also looks insane with a significant boost of performance compared to the previous version.
Note that these evals are not… pic.twitter.com/bN5WaTvkjM
According to the leaked information, Llama 3.1 has been trained on over 15 trillion tokens sourced from publicly available datasets. Their fine-tuning data comprises publicly available instruction-tuning datasets, along with an additional 15 million synthetic samples.
The models are explicitly advertised as multilingual, offering support for French, German, Hindi, Italian, Portuguese, Spanish, and Thai.
According to benchmarks, Llama 3.1 outperforms OpenAI’s GPT-4o in categories such as general knowledge, reasoning, reading comprehension, code generation, and multilingual capabilities. “Open-source is about to be SOTA — even the 70B is > gpt-4o, and this is before instruct tuning, which should make it even better,” posted a user on X.
Breaking: Rumors on the street (SF) say that Llama-3.1 is going to be released today.
— Itamar Golan 🤓 (@ItakGol) July 23, 2024
If this battle card comparing Llama-3.1 405/70/8b against GPT-4 is real, we now have SOTA Frontier Models available as open source.
Let me repeat that: models at the level of GPT-4o/GPT-4 are… pic.twitter.com/hDljBSsrDd
Llama 3.1 405B achieves a macro average accuracy of 85.2% on the MMLU benchmark, whereas GPT-4o scores 87.5%. This indicates that while GPT-4o performs well, Llama 3.1 is highly competitive.
“The 70b is really encroaching on the 405b’s territory. I can’t imagine it being worthwhile to host the 405B. This feels like a confirmation that the only utility of big models right now is to distil from it,” posted another user.
Llama 3.1 405B is expected to be highly effective in generating datasets for smaller models. One user on Reddit pointed out that this could be a major advancement for “distillation”, likening it to the relationship between GPT-4 and GPT-4o.
They suggested using the 3.1 70B for “fast inference” and the Llama 3.1 405B for dataset creation and critical flows. “Who will use Llama-3.1-405B to create the best training datasets for smaller models?” asked Jiquan Ngiam, founder of Lutra AI.
“Honestly might be more excited for 3.1 70b and 8b. Those look absolutely cracked, must be distillations of 405b,” posted another user on Reddit, who goes by the name thatrunningguy.
OpenAI co-founder Andrej Karpathy also explained that in the future, as larger models help refine and optimise the training process, smaller models will emerge. “The models have to first get larger before they can get smaller because we need their (automated) help to refactor and mould the training data into ideal, synthetic formats.”
Last week, we saw the release of several small models that can be run locally without relying on the cloud. Small language models, or SLMs, are expected to become the future alongside generalised models like GPT-4 or Claude 3.5 Sonnet.
“For everyday use, an 8B or even a 70B LLM will suffice. If you don’t need to push a model to its limits, a SOTA model isn’t necessary for routine questions.”
OpenAI has just caught its breath
OpenAI’s recent compact and cost-effective model, GPT-4o mini, has excelled on benchmarks, achieving 82% on MMLU, 87% on MGSM for maths reasoning, and 87.2% on HumanEval for coding tasks. However, Meta’s Llama 3.1 70B Instruct is closely competitive, matching these impressive scores.
“GPT-4o mini, launched just 4 days ago, is already processing over 200 billion tokens per day! I’m very happy to hear how much people are enjoying the new model,” posted OpenAI chief Sam Altman on X.
GPT-4o mini launched 4 days ago.
— Sam Altman (@sama) July 22, 2024
already processing more than 200B tokens per day!
very happy to hear how much people are liking the new model.
OpenAI’s ongoing concern has been the computational resources required, which delays the development of their next frontier model. Notably, GPT-4o’s voice capabilities have not yet been made available, and Sora remains unpublished for general use.
Meanwhile OpenAI has been holding talks with chip designers including Broadcom about working on the chip to reduce its dependency on NVIDIA. Notably, CEO Jensen Huang personally hand-delivered the first NVIDIA DGX H200 to OpenAI.
OpenAI has recently begun training its next frontier model most likely to be GPT-5 and the company anticipates the resulting systems to bring us to the next level of capabilities on our path to AGI.
At Microsoft Build, CTO Kevin Scott said that if the system that trained GPT-3 was a shark and GPT-4 an orca, the model being trained now is the size of a whale. “This whale-sized supercomputer is hard at work right now,” he added.
“We’re bringing in the latest H200s to Azure later this year and will be among the first cloud providers to offer NVIDIA’s Blackwell GPUs in B100 as well as GB200 configurations,” said Microsoft chief Satya Nadella.
On the other hand, earlier this year, Zuckerberg announced that they are building massive compute infrastructure to support their future roadmap, including 350,000 H100s by the end of this year, and a total of nearly 600,000 H100-equivalent compute units.
With Llama 3.1, Meta has made it clear that their focus spans the entire LLM market, regardless of size. Rumours suggest that Meta has already begun training Llama 4, which is expected to be multimodal with audio features and integrated into the Meta Ray-Ban glasses.