Memory – NVIDIA Technical Blog

Memory – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-03-19T21:31:15Z http://www.open-lab.net/blog/feed/ Dongxu Yang <![CDATA[Optimizing Memory and Retrieval for Graph Neural Networks with WholeGraph, Part 1]]> http://www.open-lab.net/blog/?p=79288 2024-04-09T23:45:29Z 2024-03-08T22:13:55Z

Graph neural networks (GNNs) have revolutionized machine learning for graph-structured data. Unlike traditional neural networks, GNNs are good at capturing...]]>

Graph neural networks (GNNs) have revolutionized machine learning for graph-structured data. Unlike traditional neural networks, GNNs are good at capturing... An illustration representing WholeGraph.

An illustration representing WholeGraph.

Graph neural networks (GNNs) have revolutionized machine learning for graph-structured data. Unlike traditional neural networks, GNNs are good at capturing intricate relationships in graphs, powering applications from social networks to chemistry. They shine particularly in scenarios like node classification, where they predict labels for graph nodes, and link prediction, where they determine the��

]]> 0 Rohil Bhargava <![CDATA[Deploying Retrieval-Augmented Generation Applications on NVIDIA GH200 Delivers Accelerated Performance]]> http://www.open-lab.net/blog/?p=74632 2024-09-22T15:11:34Z 2023-12-18T17:00:00Z

Large language model (LLM) applications are essential in enhancing productivity across industries through natural language. However, their effectiveness is...]]>

Large language model (LLM) applications are essential in enhancing productivity across industries through natural language. However, their effectiveness is...

nvidia-grace-hopper

Large language model (LLM) applications are essential in enhancing productivity across industries through natural language. However, their effectiveness is often limited by the extent of their training data, resulting in poor performance when dealing with real-time events and new knowledge the LLM isn��t trained on. Retrieval-augmented generation (RAG) solves these problems.

]]> 3 John Hubbard <![CDATA[Simplifying GPU Application Development with Heterogeneous Memory Management]]> http://www.open-lab.net/blog/?p=69542 2023-09-13T17:07:34Z 2023-08-22T17:00:00Z

Heterogeneous Memory Management (HMM) is a CUDA memory management feature that extends the simplicity and productivity of the CUDA Unified Memory programming...]]>

Heterogeneous Memory Management (HMM) is a CUDA memory management feature that extends the simplicity and productivity of the CUDA Unified Memory programming...

globe-regions-in-color

]]> 0 Rob Van der Wijngaart <![CDATA[Boosting Application Performance with GPU Memory Access Tuning]]> http://www.open-lab.net/blog/?p=47928 2023-06-12T20:34:13Z 2022-06-27T17:50:59Z

NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, as GPUs also have...]]>

NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, as GPUs also have...

cuda-image-16x9

NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, as GPUs also have high memory bandwidth, but sometimes they need the programmer��s help to saturate that bandwidth. In this post, we examine one method to accomplish that and apply it to an example taken from financial computing.

]]> 12 Vivek Kini <![CDATA[Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2]]> http://www.open-lab.net/blog/?p=35152 2022-08-21T23:52:21Z 2021-07-27T20:47:33Z

In part 1 of this series, we introduced new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be...]]>

In part 1 of this series, we introduced new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be...

CUDA-malloc-FeaturedImage

In part 1 of this series, we introduced new API functions, and , that enable memory allocation and deallocation to be stream-ordered operations. In this post, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also cover advanced topics to take advantage of stream-ordered��

]]> 12 Vivek Kini <![CDATA[Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1]]> http://www.open-lab.net/blog/?p=35109 2022-08-21T23:52:19Z 2021-07-27T20:46:43Z

Most CUDA developers are familiar with the cudaMalloc and cudaFree API functions to allocate GPU accessible memory. However, there has long been an obstacle...]]>

Most CUDA developers are familiar with the cudaMalloc and cudaFree API functions to allocate GPU accessible memory. However, there has long been an obstacle...

CUDA-malloc-FeaturedImage

Most CUDA developers are familiar with the and API functions to allocate GPU accessible memory. However, there has long been an obstacle with these API functions: they aren��t stream ordered. In this post, we introduce new API functions, and , that enable memory allocation and deallocation to be stream-ordered operations. In part 2 of this series, we highlight the benefits of this new��

]]> 1 Peter Morley <![CDATA[Reducing Acceleration Structure Memory with NVIDIA RTXMU]]> http://www.open-lab.net/blog/?p=34563 2024-12-09T16:53:32Z 2021-07-19T13:00:00Z

Acceleration structures spatially organize geometry to accelerate ray tracing traversal performance. When you create an acceleration structure, a conservative...]]>

Acceleration structures spatially organize geometry to accelerate ray tracing traversal performance. When you create an acceleration structure, a conservative...

RTXMU-blog-featured

Acceleration structures spatially organize geometry to accelerate ray tracing traversal performance. When you create an acceleration structure, a conservative memory size is allocated. This process is called compacting the acceleration structure and it is important for reducing the memory overhead of acceleration structures. Another key ingredient to reducing memory is suballocating��

]]> 0 Jiho Choi <![CDATA[Tips: Acceleration Structure Compaction]]> http://www.open-lab.net/blog/?p=31830 2023-07-27T19:56:16Z 2021-05-20T18:56:34Z

In ray tracing, more geometries can reside in the GPU memory than with the rasterization approach because rays may hit the geometries out of the view frustum....]]>

In ray tracing, more geometries can reside in the GPU memory than with the rasterization approach because rays may hit the geometries out of the view frustum....

BLAS_Compaction_Blog

In ray tracing, more geometries can reside in the GPU memory than with the rasterization approach because rays may hit the geometries out of the view frustum. You can let the GPU compact acceleration structures to save memory usage. For some games, compaction reduces the memory footprint for a bottom-level acceleration structure (BLAS) by at least 50%. BLASes usually take more GPU memory than top��

]]> 1 Oli Wright <![CDATA[Managing Memory for Acceleration Structures in DirectX Raytracing]]> http://www.open-lab.net/blog/?p=22913 2022-08-21T23:40:52Z 2021-01-29T19:29:50Z

In Microsoft Direct3D, anything that uses memory is considered a resource: textures, vertex buffers, index buffers, render targets, constant buffers, structured...]]>

In Microsoft Direct3D, anything that uses memory is considered a resource: textures, vertex buffers, index buffers, render targets, constant buffers, structured...

NVIDIADXR

In Microsoft Direct3D, anything that uses memory is considered a resource: textures, vertex buffers, index buffers, render targets, constant buffers, structured buffers, and so on. It��s natural to think that each individual object, such as a texture, is always one resource. In this post, I discuss DXR��s Bottom Level Acceleration Structures (BLASes) and best practices with regard to managing them.

]]> 1 Rong Ou <![CDATA[Making Apache Spark More Concurrent]]> http://www.open-lab.net/blog/?p=22897 2022-08-21T23:40:51Z 2020-12-18T23:50:30Z

Apache Spark provides capabilities to program entire clusters with implicit data parallelism. With Spark 3.0 and the open source RAPIDS Accelerator for Spark,...]]>

Apache Spark provides capabilities to program entire clusters with implicit data parallelism. With Spark 3.0 and the open source RAPIDS Accelerator for Spark,...

ApacheSpark_featuredimage

Apache Spark provides capabilities to program entire clusters with implicit data parallelism. With Spark 3.0 and the open source RAPIDS Accelerator for Spark, these capabilities are extended to GPUs. However, prior to this work, all CUDA operations happen in the default stream, causing implicit synchronization and not taking advantage of concurrency on the GPU. In this post, we look at how to use��

]]> 1 Mark Harris <![CDATA[Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager]]> http://www.open-lab.net/blog/?p=22554 2022-08-21T23:40:48Z 2020-12-08T19:27:00Z

When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high...]]>

When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high... Image depicting NVIDIA CEO Jen-Hsun Huang explaining the importance of the RAPIDS launch demo at GTC Europe 2018.

Image depicting NVIDIA CEO Jen-Hsun Huang explaining the importance of the RAPIDS launch demo at GTC Europe 2018.

When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high frequency, because its APIs generally create new and s rather than modifying them in place. The overhead of and synchronization of was holding RAPIDS back. My first task for RAPIDS was to help with this problem, so I created a rough��

]]> 9 Cory Perry <![CDATA[Introducing Low-Level GPU Virtual Memory Management]]> http://www.open-lab.net/blog/?p=16913 2024-07-30T22:16:24Z 2020-04-15T22:00:00Z

There is a growing need among CUDA applications to manage memory as quickly and as efficiently as possible. Before CUDA 10.2, the number of options available to...]]>

There is a growing need among CUDA applications to manage memory as quickly and as efficiently as possible. Before CUDA 10.2, the number of options available to...

resize-buffer-example

]]> 59 Adam Thompson <![CDATA[GPUDirect Storage: A Direct Path Between Storage and GPU Memory]]> http://www.open-lab.net/blog/?p=15376 2022-08-21T23:39:34Z 2019-08-06T13:00:31Z

As AI and HPC datasets continue to increase in size, the time spent loading data for a given application begins to place a strain on the total application��s...]]>

As AI and HPC datasets continue to increase in size, the time spent loading data for a given application begins to place a strain on the total application��s...

GPUDirect Fig 1 New

As AI and HPC datasets continue to increase in size, the time spent loading data for a given application begins to place a strain on the total application��s performance. When considering end-to-end application performance, fast GPUs are increasingly starved by slow I/O. I/O, the process of loading data from storage to GPUs for processing, has historically been controlled by the CPU.

]]> 7 Meghana Ravikumar <![CDATA[Optimizing End-to-End Memory Networks Using SigOpt and GPUs]]> http://www.open-lab.net/blog/?p=13572 2022-08-21T23:39:19Z 2019-02-21T14:00:45Z

Natural language systems have become the go-between for humans and AI-assisted digital services. Digital assistants, chatbots, and automated HR systems all rely...]]>

Natural language systems have become the go-between for humans and AI-assisted digital services. Digital assistants, chatbots, and automated HR systems all rely...

e2e_network

Natural language systems have become the go-between for humans and AI-assisted digital services. Digital assistants, chatbots, and automated HR systems all rely on understanding language, working in the space of question answering. So what are question answering (QA) systems and why do they matter? In general, QA systems take some sort of context in the form of natural language and retrieve��

]]> 0 Nikolay Sakharnykh <![CDATA[Maximizing Unified Memory Performance in CUDA]]> http://www.open-lab.net/blog/parallelforall/?p=8603 2022-08-21T23:38:33Z 2017-11-20T03:37:53Z

Many of today's applications process large volumes of data. While GPU architectures have very fast HBM or GDDR memory, they have limited capacity. Making the...]]>

Many of today's applications process large volumes of data. While GPU architectures have very fast HBM or GDDR memory, they have limited capacity. Making the... Unified Memory

Unified Memory

Many of today��s applications process large volumes of data. While GPU architectures have very fast HBM or GDDR memory, they have limited capacity. Making the most of GPU performance requires the data to be as close to the GPU as possible. This is especially important for applications that iterate over the same data multiple times or have a high flops/byte ratio. Many real-world codes have to��

]]> 18 Denis Foley <![CDATA[NVLink, Pascal and Stacked Memory: Feeding the Appetite for Big Data]]> http://www.open-lab.net/blog/parallelforall/?p=3097 2022-08-21T23:37:04Z 2014-03-25T16:31:41Z

For more recent info on NVLink, check out the?post, "How NVLink Will Enable Faster, Easier Multi-GPU Computing". NVIDIA GPU accelerators have emerged in...]]>

For more recent info on NVLink, check out the?post, "How NVLink Will Enable Faster, Easier Multi-GPU Computing". NVIDIA GPU accelerators have emerged in...

stacked_memory

For more recent info on NVLink, check out the post, ��How NVLink Will Enable Faster, Easier Multi-GPU Computing��. NVIDIA GPU accelerators have emerged in High-Performance Computing as an energy-efficient way to provide significant compute capability. The Green500 supercomputer list makes this clear: the top 10 supercomputers on the list feature NVIDIA GPUs. Today at the 2014 GPU Technology��

]]> 14 Justin Luitjens <![CDATA[CUDA Pro Tip: Increase Performance with Vectorized Memory Access]]> http://www.open-lab.net/blog/parallelforall/?p=2287 2022-08-21T23:36:58Z 2013-12-04T18:37:25Z

Many CUDA kernels are bandwidth bound, and the increasing ratio of flops to bandwidth in new hardware results in more bandwidth bound kernels. This makes it...]]>

Many CUDA kernels are bandwidth bound, and the increasing ratio of flops to bandwidth in new hardware results in more bandwidth bound kernels. This makes it... GPU Pro Tip

GPU Pro Tip

]]> 23 Mark Harris <![CDATA[Unified Memory in CUDA 6]]> http://www.open-lab.net/blog/parallelforall/?p=2221 2022-08-21T23:36:58Z 2013-11-18T15:59:27Z

With CUDA 6, NVIDIA introduced one of the most dramatic programming model improvements in the history of the CUDA platform, Unified Memory. In a typical PC or...]]>

With CUDA 6, NVIDIA introduced one of the most dramatic programming model improvements in the history of the CUDA platform, Unified Memory. In a typical PC or...

Unified Memory

With CUDA 6, NVIDIA introduced one of the most dramatic programming model improvements in the history of the CUDA platform, Unified Memory. In a typical PC or cluster node today, the memories of the CPU and GPU are physically distinct and separated by the PCI-Express bus. Before CUDA 6, that is exactly how the programmer has to view things. Data that is shared between the CPU and GPU must be��

]]> 87 Mark Harris <![CDATA[Using Shared Memory in CUDA C/C++]]> http://www.parallelforall.com/?p=964 2022-08-21T23:36:50Z 2013-01-29T07:18:11Z

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...]]>

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...

CUDA_Cube_1K

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is problematic regardless of the generation of the CUDA��

]]> 36 Greg Ruetsch <![CDATA[Using Shared Memory in CUDA Fortran]]> http://www.parallelforall.com/?p=548 2023-06-12T21:18:21Z 2013-01-15T12:01:23Z

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...]]>

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...

cuda_fortran_simple

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is problematic regardless of the generation of��

]]> 0 Mark Harris <![CDATA[How to Access Global Memory Efficiently in CUDA C/C++ Kernels]]> http://www.parallelforall.com/?p=926 2022-08-21T23:36:49Z 2013-01-08T07:13:44Z

In the previous two posts we looked at how to move data efficiently between the host and device. In this sixth post of our CUDA C/C++ series we discuss how to...]]>

In the previous two posts we looked at how to move data efficiently between the host and device. In this sixth post of our CUDA C/C++ series we discuss how to...

CUDA_Cube_1K

In the previous two posts we looked at how to move data efficiently between the host and device. In this sixth post of our CUDA C/C++ series we discuss how to efficiently access device memory, in particular global memory, from within kernels. There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior. So far in this series we have used global��

]]> 7 Greg Ruetsch <![CDATA[How to Access Global Memory Efficiently in CUDA Fortran Kernels]]> http://www.parallelforall.com/?p=521 2022-08-21T23:36:48Z 2013-01-04T02:16:42Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

In the previous two posts we looked at how to move data efficiently between the host and device. In this sixth post of our CUDA Fortran series we discuss how to efficiently access device memory, in particular global memory, from within kernels. There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior. So far in this series we have used global��

]]> 0 Mark Harris <![CDATA[How to Overlap Data Transfers in CUDA C/C++]]> http://www.parallelforall.com/?p=883 2022-08-21T23:36:49Z 2012-12-14T02:24:51Z

In our last CUDA C/C++ post we discussed how to transfer data efficiently between the host and device. In this post, we discuss how to overlap data...]]>

In our last CUDA C/C++ post we discussed how to transfer data efficiently between the host and device. In this post, we discuss how to overlap data...

CUDA Blog Image 1000x600

In our last CUDA C/C++ post we discussed how to transfer data efficiently between the host and device. In this post, we discuss how to overlap data transfers with computation on the host, computation on the device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other operations requires the use of CUDA streams, so first let��s learn��

]]> 23 Greg Ruetsch <![CDATA[How to Overlap Data Transfers in CUDA Fortran]]> http://test.markmark.net/?p=495 2022-08-21T23:36:48Z 2012-12-11T12:35:06Z

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

In my previous CUDA Fortran post I discussed how to transfer data efficiently between the host and device. In this post, I discuss how to overlap data transfers with computation on the host, computation on the device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other operations requires the use of CUDA streams, so first let��s��

]]> 0 Mark Harris <![CDATA[How to Optimize Data Transfers in CUDA C/C++]]> http://www.parallelforall.com/?p=805 2022-08-21T23:36:49Z 2012-12-05T01:20:31Z

In the previous three posts of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this...]]>

In the previous three posts of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this...

CUDA_Cube_1K

In the previous three posts of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this and the following post we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak bandwidth between the device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050��

]]> 12 Greg Ruetsch <![CDATA[How to Optimize Data Transfers in CUDA Fortran]]> http://test.markmark.net/?p=434 2022-08-21T23:36:47Z 2012-11-29T18:08:36Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

In the previous three posts of this CUDA Fortran series we laid the groundwork for the major thrust of the series: how to optimize CUDA Fortran code. In this and the following post we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak bandwidth between the device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050��

]]> 2 ��˳��97caoporen��