Graph neural networks (GNNs) have revolutionized machine learning for graph-structured data. Unlike traditional neural networks, GNNs are good at capturing intricate relationships in graphs, powering applications from social networks to chemistry. They shine particularly in scenarios like node classification, where they predict labels for graph nodes, and link prediction, where they determine the��
]]>Large language model (LLM) applications are essential in enhancing productivity across industries through natural language. However, their effectiveness is often limited by the extent of their training data, resulting in poor performance when dealing with real-time events and new knowledge the LLM isn��t trained on. Retrieval-augmented generation (RAG) solves these problems.
]]>NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, as GPUs also have high memory bandwidth, but sometimes they need the programmer��s help to saturate that bandwidth. In this post, we examine one method to accomplish that and apply it to an example taken from financial computing.
]]>In part 1 of this series, we introduced new API functions, and , that enable memory allocation and deallocation to be stream-ordered operations. In this post, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also cover advanced topics to take advantage of stream-ordered��
]]>Most CUDA developers are familiar with the and API functions to allocate GPU accessible memory. However, there has long been an obstacle with these API functions: they aren��t stream ordered. In this post, we introduce new API functions, and , that enable memory allocation and deallocation to be stream-ordered operations. In part 2 of this series, we highlight the benefits of this new��
]]>Acceleration structures spatially organize geometry to accelerate ray tracing traversal performance. When you create an acceleration structure, a conservative memory size is allocated. This process is called compacting the acceleration structure and it is important for reducing the memory overhead of acceleration structures. Another key ingredient to reducing memory is suballocating��
]]>In ray tracing, more geometries can reside in the GPU memory than with the rasterization approach because rays may hit the geometries out of the view frustum. You can let the GPU compact acceleration structures to save memory usage. For some games, compaction reduces the memory footprint for a bottom-level acceleration structure (BLAS) by at least 50%. BLASes usually take more GPU memory than top��
]]>In Microsoft Direct3D, anything that uses memory is considered a resource: textures, vertex buffers, index buffers, render targets, constant buffers, structured buffers, and so on. It��s natural to think that each individual object, such as a texture, is always one resource. In this post, I discuss DXR��s Bottom Level Acceleration Structures (BLASes) and best practices with regard to managing them.
]]>Apache Spark provides capabilities to program entire clusters with implicit data parallelism. With Spark 3.0 and the open source RAPIDS Accelerator for Spark, these capabilities are extended to GPUs. However, prior to this work, all CUDA operations happen in the default stream, causing implicit synchronization and not taking advantage of concurrency on the GPU. In this post, we look at how to use��
]]>When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high frequency, because its APIs generally create new and s rather than modifying them in place. The overhead of and synchronization of was holding RAPIDS back. My first task for RAPIDS was to help with this problem, so I created a rough��
]]>As AI and HPC datasets continue to increase in size, the time spent loading data for a given application begins to place a strain on the total application��s performance. When considering end-to-end application performance, fast GPUs are increasingly starved by slow I/O. I/O, the process of loading data from storage to GPUs for processing, has historically been controlled by the CPU.
]]>Natural language systems have become the go-between for humans and AI-assisted digital services. Digital assistants, chatbots, and automated HR systems all rely on understanding language, working in the space of question answering. So what are question answering (QA) systems and why do they matter? In general, QA systems take some sort of context in the form of natural language and retrieve��
]]>Many of today��s applications process large volumes of data. While GPU architectures have very fast HBM or GDDR memory, they have limited capacity. Making the most of GPU performance requires the data to be as close to the GPU as possible. This is especially important for applications that iterate over the same data multiple times or have a high flops/byte ratio. Many real-world codes have to��
]]>For more recent info on NVLink, check out the post, ��How NVLink Will Enable Faster, Easier Multi-GPU Computing��. NVIDIA GPU accelerators have emerged in High-Performance Computing as an energy-efficient way to provide significant compute capability. The Green500 supercomputer list makes this clear: the top 10 supercomputers on the list feature NVIDIA GPUs. Today at the 2014 GPU Technology��
]]>With CUDA 6, NVIDIA introduced one of the most dramatic programming model improvements in the history of the CUDA platform, Unified Memory. In a typical PC or cluster node today, the memories of the CPU and GPU are physically distinct and separated by the PCI-Express bus. Before CUDA 6, that is exactly how the programmer has to view things. Data that is shared between the CPU and GPU must be��
]]>In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is problematic regardless of the generation of the CUDA��
]]>In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is problematic regardless of the generation of��
]]>In the previous two posts we looked at how to move data efficiently between the host and device. In this sixth post of our CUDA C/C++ series we discuss how to efficiently access device memory, in particular global memory, from within kernels. There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior. So far in this series we have used global��
]]>In the previous two posts we looked at how to move data efficiently between the host and device. In this sixth post of our CUDA Fortran series we discuss how to efficiently access device memory, in particular global memory, from within kernels. There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior. So far in this series we have used global��
]]>In our last CUDA C/C++ post we discussed how to transfer data efficiently between the host and device. In this post, we discuss how to overlap data transfers with computation on the host, computation on the device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other operations requires the use of CUDA streams, so first let��s learn��
]]>In my previous CUDA Fortran post I discussed how to transfer data efficiently between the host and device. In this post, I discuss how to overlap data transfers with computation on the host, computation on the device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other operations requires the use of CUDA streams, so first let��s��
]]>In the previous three posts of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this and the following post we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak bandwidth between the device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050��
]]>In the previous three posts of this CUDA Fortran series we laid the groundwork for the major thrust of the series: how to optimize CUDA Fortran code. In this and the following post we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak bandwidth between the device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050��
]]>