Llama 2 vs GPT-4 vs Claude-2 – Comparison

While GPT-4 and Claude 2 are better at coding, Llama 2 excels at writing

Share

Published on July 19, 2023

by Shritama Saha

Last night Meta released LLaMA 2, an upgraded version of its large language model LLaMa, in a surprise partnership with Microsoft. Soon to be available on the Microsoft Azure platform catalogue and Amazon SageMaker, the model can be used for both research and commercial purposes through licensing.

The introduction of the 7B, 13B, and 70B pre-trained and fine-tuned parameter models shows a remarkable 40% increase in pre-trained data, leveraging larger context data for training, and employing GQA (Generalised Question-Answering) to enhance the inference capabilities of the larger model.

Meanwhile, over the past couple of months, several companies have launched their own LLMs including TII’s Falcon, Stanford’s Alpaca and Vicuna-13B, Anthropic’s Claude 2 and more. So before your timeline gets flooded with posts like “ChatGPT is just the tip of the iceberg, Llama is here” or “Meta is Microsoft’s new favourite child”, let’s cut to the chase and see how these models fair.

Grades Matter

Llama 2-Chat was made using fine-tuning and reinforcement learning with human feedback, involving preference data collection and training reward models, including a new technique like Ghost Attention (GAtt). It is also trained on GPT-4 outputs. Meta performed human study to evaluate the helpfulness of Llama-2 using 4,000 prompts. The “win rate” metric was used to compare the models, similar to the Vicuna benchmark. The study compares Llama 2-Chat models to both open-source and closed-source models like ChatGPT and PaLM using single and multi-turn prompts.

The 70B Llama-2 model performs roughly on par with GPT-3.5-0301 and outperforms Falcon, MPT, and Vicuna. Llama 2-Chat models outperform open-source models in terms of helpfulness for both single and multi-turn prompts. It has a win rate of 36% and a tie rate of 31.5% compared to ChatGPT. It also outperforms the MPT-7B-chat model on 60% of the prompts. The Llama 2-Chat 34B model has an overall win rate of over 75% against the equivalently sized Vicuna-33B and Falcon 40B models. Additionally, the 70B model outperforms the PaLM-bison chat model by a significant margin.

However, Llama-2 is weak in coding.

It is not better than GPT-3.5 (48.1) level or GPT-4 (67) when it comes to coding. Although it MMLU (Massive Multitask Language Understanding) benchmark is good, HumanEval shows coding capability is quite a bit lower compared to StarCoder (33.6) or many other models specifically designed for coding. But, considering that Llama-2 has open weights, it is highly likely that it will improve significantly over time.

On the other hand, Claude-2 excels in coding, mathematics, and logical thinking, including the ability to comprehend PDFs—a task that GPT 4 still struggles with. It attained an impressive score of 71.2% on the Codex HumanEval, an evaluation specifically designed to assess Python coding skills.

When it comes to writing, Llama-2 and GPT-4 are very different, too.

When asked to write a poem, both had a different approach. ChatGPT seems to have more intentional word choices which are more focused on the way words sound, a more sophisticated poet with a wider vocabulary. While Llama-2 uses a more obvious rhyming word selection, like a high school poem.

I asked both Llama-2 and GPT-4 to write a poem about their epic competition. Guess which one is which.

========= Poem 1 =========

In the grand tapestry of technology's weave,
Where information turns and ideas cleave,
Two figures stand, their stories interweave,
GPT and Llama-2,…
— Jim Fan (@DrJimFan) July 18, 2023

Even though Llama-2 is trained on a much smaller scale, its output is commendable, as per several users who have beta access. Meta initially used publicly available data but since it’s insufficient, they collected high-quality data, achieving better results with fewer examples and observed the impact of different platforms or vendors on performance and found the model’s outputs comparable to human annotations.

Open Source Or Openness?

Building LLaMA likely cost Meta over USD 20 million. And although it is being touted as open-source, it comes with a condition. Meta is helping the open-source community by releasing the model with a commercially-friendly license.

As per the license, if a company with over 700 million active users per month needs to request permission to use the model and it will be at the discretion of Meta whether or not to grant access. To sum up, it is “Free for everyone except FAANG” as mentioned in the paper.

However, other LLM models like GPT-4, and Claude 2 are not open source but can be accessed through APIs.

Microsoft’s Second Child

Microsoft’s new partnership with Meta came as a surprise. After investing in a ten-year partnership with OpenAI, Satya Nadella seems to yearn for more. Meanwhile, Meta’s Threads managed to amass a staggering 10 million registrations within a mere seven hours of its debut, However, ChatGPT saw an unprecedented decline of 9.7% in June, marking the first downturn since its introduction in December.

When OpenAI released the paper of GPT-4, the ChatGPT maker received immense flak for being lame as it lacked crucial details about the architecture, model size, hardware, training compute, dataset construction, and training method. Researchers believe that OpenAI’s approach undermines the principles of disclosure, perpetuates biases, and fails to establish the validity of GPT-4‘s performance on human exams.

On the other hand, Meta’s white paper is itself a masterpiece. It spelt out the entire recipe, including model details, training stages, hardware, data pipeline, and annotation process. For example, there’s a systematic analysis of the effect of RLHF with nice visualisations.

According to Percy Liang, director of Stanford’s Center for Research on Foundation Models, Llama-2 poses a considerable threat to OpenAI. Meta’s research paper admits there is still a large gap in performance between LLaMA 2 and GPT-4. So even though LLaMA 2 can’t compete with GPT-4 on all parameters, it has the potential to make it better. “To have Llama-2 become the leading open-source alternative to OpenAI would be a huge win for Meta,” says Steve Weber, a professor at the University of California, Berkeley.

Thus, with the arrival of Meta’s Llama-2, Microsoft now has a new child to rely upon should its older child fail.

Read more: Claude-2 Vs GPT-4

📣 Want to advertise in AIM? Book here

Shritama Saha

Shritama (she/her) is a technology journalist at AIM who is passionate to explore generative AI with a special focus on big techs, database, healthcare, DE&I, hiring in tech and more.