UHG
Search
Close this search box.

Intel Unveils New Low-Latency LLM Inference Solution Optimized for Intel GPUs

As LLMs continue to play a pivotal role across various industries, optimising their performance has become a critical focus

Share

Recently, Intel researchers unveiled a new LLM inference solution with low latency and high throughput for Intel GPUs. They showed that their solution achieved up to 7x lower latency and up to 27x higher throughput than standard HuggingFace implementation. 

As LLMs continue to play a pivotal role across various industries, optimising their performance has become a critical focus, and Intel’s latest development promises to be a game-changer. Tackling the inherent complexity of LLMs, characterised by intricate model structures and autoregressive inference modes, the team behind this breakthrough presents an efficient alternative.

One of the primary challenges the research team addresses is the intricate design of LLMs, characterised by intricate model structures and extensive autoregressive operations. The complexity leads to massive memory access and hampers inference speed.

A simplified LLM decoder layer is at the heart of their solution, strategically designed to fuse data movement and element-wise operations. This fusion reduces memory access frequency and significantly lowers system latency, paving the way for faster and more efficient inference processes.

Read: What is Intel’s AI Plan for 2024

How is Intel pushing the boundaries?

Intel’s solution begins with a streamlined approach to the LLM decoder layer. The team successfully reduces memory access frequency by fusing data movement and element-wise operations, substantially lowering system latency.

Another key innovation is introducing a segment KV (key/value) cache policy. This strategic separation of key and value elements for request and response tokens in distinct physical memory segments proves instrumental in effective device memory management. The outcome is an expanded runtime batch size and improved overall system throughput.

The team customises a Scaled-Dot-Product-Attention kernel to complement their innovative approach, aligning it seamlessly with their fusion policy based on the segment KV cache solution. The result is a finely tuned LLM inference solution that promises to reshape the efficiency standards for these powerful language models.

The research team has not only conceptualised these innovations but has also translated them into a practical solution. Their LLM inference solution is implemented on Intel GPUs and is now publicly available for scrutiny and use.

The substantial reduction in token latency enhances system responsiveness, making it an ideal fit for applications where quick processing is crucial. Simultaneously, the significant boost in throughput facilitates the swift execution of larger tasks, making this solution particularly attractive for real-world, high-demand scenarios.

📣 Want to advertise in AIM? Book here

Picture of Sandhra Jayan

Sandhra Jayan

Sandhra Jayan is an enthusiastic tech journalist with a flair for uncovering the latest trends in the AI landscape. Known for her compelling storytelling and insightful analysis, she transforms complex tech narratives into captivating, accessible content. Reach out to her at sandhra.jayan@analyticsindiamag.com
Related Posts
19th - 23rd Aug 2024
Generative AI Crash Course for Non-Techies
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord-icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.