The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multinode communication primitives optimized for NVIDIA GPUs and networking. NCCL is a central piece of software for multi-GPU deep learning training. It handles any kind of inter-GPU communication, be it over PCI, NVLink, or networking. It uses advanced topology detection, optimized communication graphs…
]]>Historically, the GPU device code is compiled alongside the application with offline tools such as . In this case, the GPU device code is managed internally to the CUDA runtime. You can then launch kernels using and the CUDA runtime ensures that the invoked kernel is launched. However, in some cases, GPU device code needs to be dynamically compiled and loaded. This post shows a way to…
]]>The latest release of the CUDA Toolkit, version 12.8, continues to push accelerated computing performance in data sciences, AI, scientific computing, and computer graphics and simulation, using the latest NVIDIA CPUs and GPUs. This post highlights some of the new features and enhancements included with this release: CUDA Toolkit 12.8 is the first version of the Toolkit to support…
]]>Whether you’re just starting your GPU programming journey or you’re a CUDA ninja looking to share advanced techniques, join us in San Jose on 1/30/25.
]]>For the past few months, the NVIDIA Collective Communications Library (NCCL) developers have been working hard on a set of new library features and bug fixes. In this post, we discuss the details of the NCCL 2.22 release and the pain points addressed. NVIDIA Magnum IO NCCL is a library designed to optimize inter-GPU and multi-node communication, crucial for efficient parallel computing…
]]>CUDA Graphs are a way to define and batch GPU operations as a graph rather than a sequence of stream launches. A CUDA Graph groups a set of CUDA kernels and other CUDA operations together and executes them with a specified dependency tree. It speeds up the workflow by combining the driver activities associated with CUDA kernel launches and CUDA API calls. It also enforces the dependencies with…
]]>NVSHMEM is a parallel programming interface that provides efficient and scalable communication for NVIDIA GPU clusters. Part of NVIDIA Magnum IO and based on OpenSHMEM, NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA streams.
]]>GPUs are specially designed to crunch through massive amounts of data at high speed. They have a large amount of compute resources, called streaming multiprocessors (SMs), and an array of facilities to keep them fed with data: high bandwidth to memory, sizable data caches, and the capability to switch to other teams of workers (warps) without any overhead if an active team has run out of data.
]]>The release supports GB100 capabilities and new library enhancements to cuBLAS, cuFFT, cuSOLVER, cuSPARSE, as well as the release of Nsight Compute 2024.3.
]]>With the R515 driver, NVIDIA released a set of Linux GPU kernel modules in May 2022 as open source with dual GPL and MIT licensing. The initial release targeted datacenter compute GPUs, with GeForce and Workstation GPUs in an alpha state. At the time, we announced that more robust and fully-featured GeForce and Workstation Linux support would follow in subsequent releases and the NVIDIA Open…
]]>NVIDIA is excited to collaborate with Colfax, Together.ai, Meta, and Princeton University on their recent achievement to exploit the Hopper GPU architecture and Tensor Cores and accelerate key Fused Attention kernels using CUTLASS 3. FlashAttention-3 incorporates key techniques to achieve 1.5–2.0x faster performance than FlashAttention-2 with FP16, up to 740 TFLOPS. With FP8…
]]>Post updated on February 3, 2025 with details about CUDA 12.8. CUDA Graphs can provide a significant performance increase, as the driver is able to optimize execution using the complete description of tasks and dependencies. Graphs provide incredible benefits for static workflows where the overhead of graph creation can be amortized over many successive launches. However…
]]>NVIDIA GPUs are becoming increasingly powerful with each new generation. This increase generally comes in two forms. Each streaming multi-processor (SM), the workhorse of the GPU, can execute instructions faster and faster, and the memory system can deliver data to the SMs at an ever-increasing pace. At the same time, the number of SMs also typically increases with each generation…
]]>The latest release of CUDA Toolkit, version 12.4, continues to push accelerated computing performance using the latest NVIDIA GPUs. This post explains the new features and enhancements included in this release: CUDA and the CUDA Toolkit software provide the foundation for all NVIDIA GPU-accelerated computing applications in data science and analytics, machine learning…
]]>Many CUDA applications running on multi-GPU platforms usually use a single GPU for their compute needs. In such scenarios, a performance penalty is paid by applications because CUDA has to enumerate/initialize all the GPUs on the system. If a CUDA application does not require other GPUs to be visible and accessible, you can launch such applications by isolating the unwanted GPUs from the CUDA…
]]>The latest release of CUDA Toolkit continues to push the envelope of accelerated computing performance using the latest NVIDIA GPUs. New features of this release, version 12.3, include: CUDA and the CUDA Toolkit continue to provide the foundation for all accelerated computing applications in data science, machine learning and deep learning, generative AI with LLMs for both training and…
]]>Today, NVIDIA announces the public release of TensorRT-LLM to accelerate and optimize inference performance for the latest LLMs on NVIDIA GPUs. This open-source library is now available for free on the /NVIDIA/TensorRT-LLM GitHub repo and as part of the NVIDIA NeMo framework. Large language models (LLMs) have revolutionized the field of artificial intelligence and created entirely new ways of…
]]>Large language models (LLMs) offer incredible new capabilities, expanding the frontier of what is possible with AI. However, their large size and unique execution characteristics can make them difficult to use in cost-effective ways. NVIDIA has been working closely with leading companies, including Meta, Anyscale, Cohere, Deci, Grammarly, Mistral AI, MosaicML (now a part of Databricks)…
]]>The latest release of CUDA Toolkit 12.2 introduces a range of essential new features, modifications to the programming model, and enhanced support for hardware capabilities accelerating CUDA applications. Now out through general availability from NVIDIA, CUDA Toolkit 12.2 includes many new capabilities, both major and minor. The following post offers an overview of many of the key…
]]>Watch on-demand as experts deep dive into CUDA 12.2, including support for confidential computing.
]]>NVIDIA announces the cuNumeric and Legate beta release. The cuNumeric library provides an accelerated NumPy alternative, while Legate provides a parallel computing runtime abstraction layer.
]]>Available now for download, the CUDA Toolkit 12.1 release provides support for NVIDIA Hopper and NVIDIA Ada Lovelace architecture.
]]>CUDA Toolkit 12.0 introduces a new nvJitLink library for Just-in-Time Link Time Optimization (JIT LTO) support. In the early days of CUDA, to get maximum performance, developers had to build and compile CUDA kernels as a single source file in whole programming mode. This limited SDKs and applications with large swaths of code, spanning multiple files that required separate compilation from porting…
]]>Learn about the newest CUDA features such as release compatibility, dynamic parallelism, lazy module loading, and support for the new NVIDIA Hopper and NVIDIA Ada Lovelace GPU architectures.
]]>CUDA Graphs significantly reduce the overhead of launching a large batch of user operations by defining them as a task graph, which may be launched in a single operation. Knowing the workflow upfront enables the CUDA driver to apply various optimizations, which cannot be performed when launching through a stream model. However, this performance comes at the cost of flexibility.
]]>Most CUDA developers are familiar with the API and its counterparts for loading a module containing device code into a CUDA context. In most cases, you want to load identical device code on all devices. This requires loading device code into each CUDA context explicitly. Moreover, libraries and frameworks that do not control context creation and destruction must keep track of them to explicitly…
]]>NVIDIA announces the newest CUDA Toolkit software release, 12.0. This release is the first major release in many years and it focuses on new programming models and CUDA application acceleration through new hardware capabilities. For more information, watch the YouTube Premiere webinar, CUDA 12.0: New Features and Beyond. You can now target architecture-specific features and instructions…
]]>CUDA Toolkit 12.0 supports NVIDIA Hopper architecture and many new features to help developers maximize performance on NVIDIA GPU-based products.
]]>NVIDIA JetPack provides a full development environment for hardware-accelerated AI-at-the-edge on Jetson platforms. Previously, a standalone version of NVIDIA JetPack supports a single release of CUDA, and you did not have the ability to upgrade CUDA on a given NVIDIA JetPack version. NVIDIA JetPack is released on a rolling cadence with a single version of CUDA…
]]>NVIDIA announces the newest CUDA Toolkit software release, 11.8. This release is focused on enhancing the programming model and CUDA application speedup through new hardware capabilities. New architecture-specific features in NVIDIA Hopper and Ada Lovelace are initially being exposed through libraries and framework enhancements. The full programming model enhancements for the NVIDIA Hopper…
]]>NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, as GPUs also have high memory bandwidth, but sometimes they need the programmer’s help to saturate that bandwidth. In this post, we examine one method to accomplish that and apply it to an example taken from financial computing.
]]>NVIDIA is now publishing Linux GPU kernel modules as open source with dual GPL/MIT license, starting with the R515 driver release. You can find the source code for these kernel modules in the NVIDIA/open-gpu-kernel-modules GitHub page This release is a significant step toward improving the experience of using NVIDIA GPUs in Linux, for tighter integration with the OS, and for developers to…
]]>NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, because GPUs also have high memory bandwidth, but sometimes they need your help to saturate that bandwidth. In this post, we examine one specific method to accomplish that: prefetching. We explain the circumstances under which prefetching can be expected…
]]>Typically, real-time physics simulation code is written in low-level CUDA C++ for maximum performance. In this post, we introduce NVIDIA Warp, a new Python framework that makes it easy to write differentiable graphics and simulation GPU code in Python. Warp provides the building blocks needed to write high-performance simulation code, but with the productivity of working in an interpreted language…
]]>NVIDIA announces the newest release of the CUDA development environment, CUDA 11.6. This release is focused on enhancing the programming model and performance of your CUDA applications. CUDA continues to push the boundaries of GPU acceleration and lay the foundation for new applications in HPC, visualization, AI, ML and DL, and data science. CUDA 11.6 has several important features.
]]>NVIDIA announces the newest release of the CUDA development environment, CUDA 11.5. CUDA 11.5 is focused on enhancing the programming model and performance of your CUDA applications. CUDA continues to push the boundaries of GPU acceleration and lay the foundation for new applications in HPC, visualization, AI, ML and DL, and data sciences. CUDA 11.5 has several important features.
]]>NVIDIA announces our newest release of the CUDA development environment consisting of GPU-accelerated libraries, debugging and optimization tools, an updated C/C++ compiler, and a runtime library to build and deploy your application on major architectures including NVIDIA Ampere, x86, Arm server processors, and POWER. The latest release, CUDA 11.4, and its features are focused on enhancing the…
]]>