Researchers at NVIDIA have developed Llama3-ChatQA-2-70B, a new large language model that rivals GPT-4-Turbo in handling long contexts up to 128,000 tokens and excels in retrieval-augmented generation (RAG) tasks.
The model, based on Meta’s Llama3, demonstrates competitive performance across various benchmarks, including long-context understanding, medium-length tasks, and short-context evaluations.
Read the full paper here
The Llama3-ChatQA-2-70B model boasts several key highlights, including its ability to process contexts up to 128,000 tokens, matching the capacity of GPT-4-Turbo. It demonstrates superior performance in RAG tasks compared to GPT-4-Turbo and delivers competitive results on long-context benchmarks extending beyond 100,000 tokens.
Additionally, the model performs strongly on medium-length tasks within 32,000 tokens and maintains effectiveness on short-context tasks within 4,000 tokens.
The researchers employed a two-step approach to extend Llama3-70B’s context window from 8,000 to 128,000 tokens. This involved continued pre-training on a mix of SlimPajama data with upsampled long sequences, followed by a three-stage instruction tuning process.
Evaluation results show that Llama3-ChatQA-2-70B outperforms many existing state-of-the-art models, including GPT-4-Turbo-2024-04-09, on the InfiniteBench long-context tasks. The model achieved an average score of 34.11, compared to GPT-4-Turbo’s 33.16.
For medium-length tasks within 32,000 tokens, Llama3-ChatQA-2-70B scored 47.37, surpassing some competitors but falling short of GPT-4-Turbo’s 51.93. On short-context tasks, the model achieved an average score of 54.81, outperforming GPT-4-Turbo and Qwen2-72B-Instruct.
The study also compared RAG and long-context solutions, finding that RAG outperforms full long-context solutions for tasks beyond 100,000 tokens. This suggests that even state-of-the-art long-context models may struggle to effectively understand and reason over such extensive inputs.
This development represents a significant step forward in open-source language models, bringing them closer to the capabilities of proprietary models like GPT-4. The researchers have provided detailed technical recipes and evaluation benchmarks, contributing to the reproducibility and advancement of long-context language models in the open-source community.