Recently, Intel researchers unveiled a new LLM inference solution with low latency and high throughput for Intel GPUs. They showed that their solution achieved up to 7x lower latency and up to 27x higher throughput than standard HuggingFace implementation.
As LLMs continue to play a pivotal role across various industries, optimising their performance has become a critical focus, and Intel’s latest development promises to be a game-changer. Tackling the inherent complexity of LLMs, characterised by intricate model structures and autoregressive inference modes, the team behind this breakthrough presents an efficient alternative.
One of the primary challenges the research team addresses is the intricate design of LLMs, characterised by intricate model structures and extensive autoregressive operations. The complexity leads to massive memory access and hampers inference speed.
A simplified LLM decoder layer is at the heart of their solution, strategically designed to fuse data movement and element-wise operations. This fusion reduces memory access frequency and significantly lowers system latency, paving the way for faster and more efficient inference processes.
Read: What is Intel’s AI Plan for 2024
How is Intel pushing the boundaries?
Intel’s solution begins with a streamlined approach to the LLM decoder layer. The team successfully reduces memory access frequency by fusing data movement and element-wise operations, substantially lowering system latency.
Another key innovation is introducing a segment KV (key/value) cache policy. This strategic separation of key and value elements for request and response tokens in distinct physical memory segments proves instrumental in effective device memory management. The outcome is an expanded runtime batch size and improved overall system throughput.
The team customises a Scaled-Dot-Product-Attention kernel to complement their innovative approach, aligning it seamlessly with their fusion policy based on the segment KV cache solution. The result is a finely tuned LLM inference solution that promises to reshape the efficiency standards for these powerful language models.
The research team has not only conceptualised these innovations but has also translated them into a practical solution. Their LLM inference solution is implemented on Intel GPUs and is now publicly available for scrutiny and use.
The substantial reduction in token latency enhances system responsiveness, making it an ideal fit for applications where quick processing is crucial. Simultaneously, the significant boost in throughput facilitates the swift execution of larger tasks, making this solution particularly attractive for real-world, high-demand scenarios.