As the industry buzzes about Meta’s latest Llama 3.1 405B release, Groq announced its proud partnership with Meta, making the latest Llama 3.1 models—including 405B Instruct, 70B Instruct, and 8B Instruct—available to the community at Groq speed.
“I’m really excited to see Groq’s ultra-low-latency inference for cloud deployments of the Llama 3.1 models… By making our models and tools available to the community, companies like Groq can build on our work and help push the whole ecosystem forward,” said Meta AI chief Mark Zuckerberg.
“Meta is developing an open operating system for AI akin to Linux—not just for the Groq LPU, which offers rapid AI inference, but for the entire ecosystem,” said Groq Chief Jonathan Ross. He added that Meta has caught up to the leading proprietary models and it’s only a matter of time before they surpass the closed models.
Former OpenAI researcher Andrej Karpathy praised Groq’s inference speed, saying, “This is so cool. It feels like AGI—you just talk to your computer and it does stuff instantly. Speed really makes AI so much more pleasing.”
He added that Groq’s new chip, called LPU, inferences LLMs really fast. “They’ve already integrated Llama 3.1 models and appear to be able to inference the 8B model ~instantly,” he said. However, due to high demand, he wasn’t able to try it out.
“And (I can’t seem to try it due to server pressure) the 405B running on Groq is probably the highest capability, fastest LLM today (?).”
In the past few months, Groq has captured attention with its promise to perform AI tasks faster and more cost-effectively than its competitors. This can be attributed to its language processing unit (LPU), which is more efficient for these tasks than GPUs due to its linear operation.
While GPUs are crucial for model training, AI applications in deployment—referred to as “inference”—require greater efficiency and lower latency.
Groq-Speed
“Groq is incredibly fast, currently up to 1200+ tokens/second. But what can you do with that speed?” said Benjamin Klieger, AI Applications Engineer at Groq, introducing StockBot—a lightning-fast open source AI chatbot powered by Llama 3-70B on Groq that responds with live stock charts, financials, news, and screeners.
While GPT-4o Mini is much faster than the median provider of Llama 3 70B, Groq offers Llama 70B at ~340 output tokens/second, which is over 2X faster than GPT-4o mini, according to a report by Artificial Analysis.
Rick Lamers, project lead at Groq, recently announced the Llama 3 Groq Tool Use models in 8B and 70B versions.
He shared on X that these models, which are open-source and fully fine-tuned for tool use, have achieved the top spot on the BFCL benchmark, surpassing all other models, including proprietary ones like Claude Sonnet 3.5, GPT-4 Turbo, GPT-4o, and Gemini 1.5 Pro.
In the past 16 weeks since its launch, Groq has offered its service to power LLM workloads for free, resulting in significant uptake from developers, now numbering over 282,000, according to Ross.
“It’s really easy to use and doesn’t cost anything to get started. You just use our API, and we’re compatible with most applications that have been built,” said Ross. He added that if any customer has a large-scale requirement and is generating millions of tokens per second, the company can deploy hardware for the customer on-premises.
What’s the Secret?
Founded in 2016 by Ross, Groq distinguishes itself by eschewing GPUs in favour of its proprietary hardware, the LPU.
Prior to Groq, Ross worked at Google, where he created the tensor processing unit (TPU). He was responsible for designing and implementing the core elements of the original TPU chip, which played a pivotal role in Google’s AI efforts, including the AlphaGo competition.
LPUs are only meant to run the LLMs and not train them. “The LPUs are about 10 times faster than GPUs when it comes to inference or the actual running of the models,” said Ross, adding that when it comes to training LLMs, that’s a task for the GPUs.
When asked about the purpose of this speed, Ross said, “Human beings don’t like to read like this, as if something is being printed out like an old teletype machine. Eyes scan a page really quickly and figure out almost instantly whether or not they’ve got what they want.”
Groq’s LPU poses a significant challenge to traditional GPU manufacturers like NVIDIA, AMD, and Intel. Groq built its tensor streaming processor specifically to speed up deep learning computations rather than modify general-purpose processors for AI.
The LPU is designed to overcome the two LLM bottlenecks: compute density and memory bandwidth. In terms of LLMs, an LPU has greater compute capacity than a GPU and CPU. This reduces the amount of time per word calculated, allowing text sequences to be generated much faster.
Additionally, eliminating external memory bottlenecks enables the LPU inference engine to deliver orders of magnitude better performance on LLMs compared to GPUs.
The LPU is designed to prioritise the sequential processing of data, which is inherent in language tasks. This contrasts with GPUs, which are optimised for parallel processing tasks such as graphics rendering.
“You can’t produce the 100th word until you’ve produced the 99th so there is a sequential component to them that you simply can’t get out of a GPU,” said Ross.
Moreover, he added that GPUs are notoriously thirsty for power, often requiring as much power as the average household per chip. “LPUs use as little as a tenth as much power,” he said.