Across the spectrum, from cloud service providers to AI labs and startups, there’s a fervent desire for access to the cutting-edge Graphics Processing Units (GPUs) available. However, the market is currently grappling with a scarcity of these high-end GPUs, which are primarily dominated by a single company, NVIDIA. This heightened demand from enterprises has led to a scarcity, resulting in soaring prices.
This is a challenge for the industry with many opining that the shortage could even stifle AI innovation. Hence, what the industry needs is competition. While NVIDIA, being a pioneer in this field, remains the leader, other GPU vendors such as Intel and AMD are making great strides and closing the gap between NVIDIA. However, choosing between multiple GPU vendors remains a complex task.
Need for an open software architecture
If your software or applications are optimised for one vendor’s GPUs, it may be challenging to transition to GPUs from one vendor to another without significant code modifications and testing. Additionally, GPU drivers and application programming interfaces (APIs) are vendor-specific. Applications that use these APIs, like CUDA or OpenCL, may not be compatible with GPUs from other vendors without significant modification. This can result in a lock-in period where switching GPUs becomes complex and costly.
Moreover, GPU vendors often provide software development kits (SDKs), libraries, and tools tailored to their GPUs. Developers may rely on these vendor-specific software components for tasks like GPU programming (e.g.CUDA for NVIDIA GPUs). Hence, switching to GPUs from a different vendor may require rewriting or adapting software to work with their software stack.
Hence, an open software architecture that facilitates the selection of GPU providers would undeniably be a significant boon to the AI community, Mohammed Imran K R, chief technology officer at E2E Networks, told AIM. Besides making it easier to choose between GPU vendors, an open software architecture would eliminate the constraints of long-term vendor lock-in, allowing AI researchers and developers to choose GPUs based on their specific requirements.
“It would also lead to a more competitive environment, pushing GPU manufacturers to innovate and offer better hardware options for AI workloads. It would also drive cost efficiency, as organisations could select GPUs based on both performance and cost, thus optimising their resources,” he said.
Furthermore, an open infrastructure would encourage collaboration within the AI community. Standardised tools and interfaces would make it easier for developers and researchers to work with different GPU platforms, potentially accelerating advancements in AI technology. “Additionally, this approach aligns with industry trends favouring open-source solutions and interoperability, empowering companies to construct adaptable technology stacks,” Shivam Arora, marketing manager at Compunnel told AIM.
Nonetheless, it is also essential to consider that developing and maintaining such an open infrastructure would require coordination from the GPU vendors, software developers and AI community. “While flexibility will be derived, performance optimisation could be an issue,” Sanjay Lodha, Chairman & Managing Director of Netweb Technologies told AIM.
OpenCL, ROCm and oneAPI
One could argue that OpenCL is one such open software architecture that already exists. Launched in 2009 by Apple and the Khronos Group to offer a standard for heterogeneous computing, OpenCL might be a viable option, but it does come with its own set of challenges. OpenCL allows you to write programs that can be executed on various GPU architectures from different vendors. “Even though OpenCL is gaining traction, it is still limited and may not provide the same level of optimisation as a vendor-specific tool like CUDA from NVIDIA,” Lodha said.
From an AI technology development standpoint, OpenCL currently has several drawbacks when compared to CUDA, with one critical aspect being that the majority of the latest research, models, and frameworks assume CUDA as the default GPU programming platform. “Additionally, achieving true cross-vendor portability can be challenging with OpenCL, as different GPU manufacturers implement it with varying degrees of compliance and performance,” Imran said.
In fact, a study comparing CUDA programs with OpenCL on NVIDIA GPUs showed that CUDA was 30% faster than OpenCL. Simultaneously, AMD’s ROCm, which is also an alternative to CUDA, is also making great strides. Interestingly, CUDA code can be converted to ROCm code using the HIP (Heterogeneous-Computing Interface for Portability) tools provided by AMD. Another interesting development in the last few years is oneAPI. While ROCm targets both AMD and NVIDIA GPUs, oneAPI applications can run on GPUs from Intel, NVIDIA and AMD, making both a viable option.
Moving away from CUDA
While enterprises might look into alternate vendors like Intel, AMD or even China-based Huawei, but for the industry to move away from CUDA could be challenging. “It involves rewriting or adapting existing code, potentially causing disruptions and requiring significant retraining of developers. However, the industry’s increasing interest in open-source alternatives indicates a growing willingness to embrace change. The difficulty of this transition ultimately depends on the specific needs and objectives of the company and its commitment to open-source principles,” Arora said.
Lodha on the other hand is a bit more sceptical. He believes it will be immensely difficult for the AI community to move away from CUDA towards a more open software architecture because many machine learning models have been trained using CUDA code. “This means that researchers and developers would need to rewrite their code in order to use a different GPU programming framework.”
Nonetheless, he believes that the benefits of moving to a more open-source GPU programming framework outweigh the costs. He further stated that an open-source framework would make it easier for researchers and developers to compare the performance of different GPUs and to choose the GPU that is best suited for their needs. It would also make it easier for vendors to compete with each other, which would lead to lower prices and better products.
“I think the best way to move from CUDA to a more open-source alternative is to transition gradually. Researchers and developers could start by writing new code in an open-source framework, such as OpenCL or RoCm. They could also start porting existing CUDA code to an open-source framework. There are already tools that are being used but still programming efforts are required.”
Imran also concurs. He thinks ensuring compatibility with other components of the software stack and achieving true cross-vendor portability is challenging at the moment. “However, in the long run, we believe that there will be alternatives and there are compelling reasons for it, including reducing vendor lock-in, promoting interoperability, and contributing to a more diverse GPU ecosystem.”