PyTorch recently released a new update, PyTorch 2.1. This new update offers automatic dynamic shape support in compiling and distributing checkpoints for parallelly saving and loading distributed training jobs on multiple ranks, alongside providing support for NumPy API.
In addition to this, it has released the beta version updates to PyTorch domain libraries for TorchAudio and TorchVision. Lastly, the community has added support training and inference of Llama 2 models powered by AWS Inferentia2.
This will make running Llama 2 models on PyTorch quicker, cheaper and more efficient. This release was the effort of 784 contributors with 6,682 commits.
New Features of PyTorch 2.1
- The new feature updates include the addition of AArch64 wheel builds which would allow devices with 64-bit ARM architecture to use PyTorch.
- They’ve added the latest CUDA 12.1 support for PyTorch binaries.
- Compile PyTorch on M1 natively instead of cross compiling it from x86 which causes performance issues. Compiling PyTorch natively on M1 would improve performance and make it easier to use it directly on Apple M1 processors.
- Enables UCC Distributed Communication Backend in CI where developers can efficiently test and debug their code with UCC support.
Improvements
- Python Frontend: the PyTorch.device can now be used as a context manager to change the default device. This is a simple but powerful feature that can make your code more concise and readable.
- Optimisation: NAdamW is a new optimiser that is more stable and efficient than the previous AdamW optimiser. NAdamW is an improved version of AdamW, stands out for its stability and efficiency, making it a superior choice for faster and more accurate model training.
- Sparse Frontend: Semi-structured sparsity is a new type of sparsity that can be more efficient than traditional sparsity patterns on NVIDIA Ampere and newer architectures.
PyTorch’s TorchAudio v2.1 Library
The new update has introduced key features like the AudioEffector API for audio waveform enhancement and Forced Alignment for precise transcript-audio synchronisation. The addition of TorchAudio-Squim models allows estimation of speech quality metrics, while a CUDA-based CTC decoder improves automatic speech recognition efficiency.
In the realm of AI music, new utilities enable music generation using AI techniques, and updated training recipes enhance model training for specific tasks. However, users need to adapt to changes like updated FFmpeg support (versions 6, 5, 4.4) and libsox integration, impacting audio file handling.
These updates expand PyTorch’s capabilities, making audio processing and AI music generation more efficient and precise. With enhanced alignment, speech quality assessment, and faster speech recognition, TorchAudio v2.1 is a valuable upgrade.
TorchRL Library
PyTorch has enhanced the RLHF components making it easy for developers to build an RLHF training loop with limited RL knowledge. TensorDict enables an easy interaction between datasets (say, HF datasets), alongside RL models. It has added new algorithms, where it offers a wide range of solutions for offline RL training, making it more data efficient.
Plus, TorchRL can now work directly with hardware, like robots, for seamless training and deployment. It has added essential algorithms and expanded its supported environments, for faster data collection and value function execution.
TorchVision Library
This new library in PyTorch is now 10%-40% faster. PyTorch achieved this thanks to 2x-4x improvements made to the second version of Resize. “This is mostly achieved thanks to 2X-4X improvements made to v2.Resize(), which now supports native uint8 tensors for Bilinear and Bicubic mode. Output results are also now closer to PIL’s!,” reads the blog.
Additionally, TorchVision now supports CutMix and MixUp augmentations. The previous beta transforms are now stabilised, offering improved performance for tasks like segmentation and detection.
Llama 2 Deployment with AWS Inferentia2 using TorchServe
Pytorch for the first time has deployed the Llama 2 model using inference using Transformer Neuron using Torch Serve. This is done through Amazon SageMaker on EC2 Inferentia2 instances. This features 3x higher compute with 4x more accelerator memory resulting in up to 4x higher throughput, and up to 10x lower latency.
The optimization techniques from AWS Neuron SDK enhance performance while keeping costs low. The Llama deployment on PyTorch also shares the benchmarking results.
The framework is integrated with Llama 2 through AWS Transformers Neuron, enabling seamless usage of Llama-2 models for optimised inference on Inf2 instances.