Last month, OpenAI released Triton 1.0, an open-source Python-like programming language that enables researchers to write highly efficient graphics processing unit (GPU) code. OpenAI claims Triton delivers substantial ease-of-use benefits over coding in CUDA, a programming tool developed by NVIDIA. The development repository for the Triton language and compiler is available on GitHub.
OpenAI scientist Philippe Tillet said the aim is to become a viable alternative to CUDA for deep learning. “It is for machine learning researchers and engineers who are unfamiliar with GPU programming despite having good software engineering skills,” he added.
Today, several high-level programming languages and libraries offer access to the GPU for certain sets of problems and algorithms. In this article, we look at the alternatives to OpenAI Triton.
OpenACC
OpenACC is a user-driven directive-based ‘performance-portable’ parallel programming model. It is designed for engineers and scientists interested in porting their codes to heterogeneous ‘HPC’ hardware platforms and architectures with significantly less programming effort than required with a low-level model. It supports C, C++, Fortran programming languages and multiple hardware architectures, including X86 & POWER CPUs and NVIDIA GPUs.
While OpenACC offers a set of directives to execute code in parallel on the GPU, such high-level abstractions are only efficient for certain classes of problems and often unsuitable for nontrivial parallelisation or data movement.
CUDA
Developed by NVIDIA for general computing, CUDA stands for Compute Unified Device Architecture. This software layer gives direct access to the GPUs virtual instruction set and parallel computational elements for the execution of compute kernels.
It is one of the leading proprietary frameworks for general-purpose computing on GPUs (GPGPU) from NVIDIA. GPGPU refers to the use of GPUs to assist in performing tasks handled by CPUs. It allows information to flow in both directions — CPU to GPU and vice versa, improving efficiency in various tasks, especially images and videos.
CUDA can work with programming languages like C, C++, and Fortran. It has applications in various fields, including life sciences, bioinformatics, computer vision, electrodynamics, computational chemistry, medical imaging, finance, etc.
PyCUDA
PyCUDA gives Pythonic access to NVIDIA’s CUDA parallel computation API. It helps in object cleanup tied to the lifetime of the object. PyCUDA knows about dependencies, too, so it won’t detach from a context before all memory allocation in it is also freed. Abstractions like SourceModule and GPUArray make CUDA programming even more convenient than with NVIDIA’s C-based runtime.
PyCUDA ensures all CUDA errors are automatically translated into Python exceptions.
OpenCL
Open computing language (OpenCL) is an open standard for writing code that runs across heterogeneous platforms, including CPUs, GPUs, digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. Notably, it provides applications with access to GPUs for GPGPU that in some cases results in significant speed-up. For example, in computer vision, many algorithms can run on a GPU much more efficiently than on a CPU, particularly in image processing, computational photography, object detection, matrix arithmetic, etc.
OpenPAI
Developed by Microsoft, OpenPAI offers complete ‘AI model’ training and resource management capabilities. The open-source platform supports on-premise, cloud, and hybrid environments. Check out more details about OpenPAI here.
CatBoost
Developed by Yandex researchers and engineers, CatBoost is an algorithm for gradient boosting on decision trees. It is used for search, recommendation systems, personal assistant, weather prediction, self-driving cars, etc. Also, it supports computation on CPU and GPU.
CatBoost has superior quality compared to GBDT libraries on many datasets; has best in class prediction speed; supports both numerical and categorical features; and fast GPU and multi-GPU support for training out of the box, and includes visualisation tools.
Tf Quant Finance
TF Quant Finance offers high-performance components leveraging the hardware acceleration support and automatic differentiation of TensorFlow.
The library provides TensorFlow support for foundational mathematical methods (optimisation, interpolation, root finders, linear algebra, etc.), mid-level methods (ODE & PDE solvers, Ito process framework, diffusion path generators, etc.), and specific pricing models (Local vol (LV), Stochastic vol (SV), Stochastic local vol (SLV), Hull-White (HW)).
Lingvo
Lingvo is an open-source framework for developing neural networks in TensorFlow, particularly sequence models. Check out the list of publications using Lingvo here.
Nyuzi Processor
Nyuzi Processor is an experimental GPGPU processor hardware design focused on compute-intensive tasks. It is optimised for use cases such as deep learning and image processing. It includes a synthesisable hardware design written in System Verilog, an instruction set emulator, an LLVM based C/C++ compiler, software libraries, and tests. It is also used to experiment with microarchitectural and instruction set design tradeoffs. More details on Nyuzi Processor can be found on GitHub.
Emu
Emu is a GPGPU library for Rust with a focus on portability, modularity, and performance. It can run anywhere as it uses WebGPU to support DirectX, Metal, Vulkan (and OpenGL and browser eventually) as a compile target. It lets Emu run on pretty much any user interface, including desktop, mobile, and browser. Also, by moving heavy computations to the user’s device, users can reduce system latency and improve privacy.
Emu makes WebGPU feel like CUDA. It is a fully transparent abstraction. In other words, you can decide to remove the abstraction and work directly with WebGPU constructs with zero overhead. Also, it is fully asynchronous.
Explore more GPU computing open source projects here.