CUDA C++ – NVIDIA Technical Blog

CUDA C++ – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-03-27T16:00:00Z http://www.open-lab.net/blog/feed/ Bradley Dice <![CDATA[Supercharging Deduplication in pandas Using RAPIDS cuDF]]> http://www.open-lab.net/blog/?p=92703 2024-12-12T19:38:34Z 2024-11-28T14:00:00Z

A common operation in data analytics is to drop duplicate rows. Deduplication is critical in Extract, Transform, Load (ETL) workflows, where you might want to...]]>

A common operation in data analytics is to drop duplicate rows. Deduplication is critical in Extract, Transform, Load (ETL) workflows, where you might want to...

green-background-white-points

]]> 0 Rob Van der Wijngaart <![CDATA[Improving GPU Performance by Reducing Instruction Cache Misses]]> http://www.open-lab.net/blog/?p=86868 2025-01-22T17:57:59Z 2024-08-08T16:30:00Z

GPUs are specially designed to crunch through massive amounts of data at high speed. They have a large amount of compute resources, called streaming...]]>

GPUs are specially designed to crunch through massive amounts of data at high speed. They have a large amount of compute resources, called streaming... Decorative image of light fields in green, purple, and blue.

Decorative image of light fields in green, purple, and blue.

GPUs are specially designed to crunch through massive amounts of data at high speed. They have a large amount of compute resources, called streaming multiprocessors (SMs), and an array of facilities to keep them fed with data: high bandwidth to memory, sizable data caches, and the capability to switch to other teams of workers (warps) without any overhead if an active team has run out of data.

]]> 4 Graham Lopez <![CDATA[Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip]]> http://www.open-lab.net/blog/?p=72720 2023-11-16T19:16:39Z 2023-11-13T17:13:02Z

The new hardware developments in NVIDIA Grace Hopper Superchip systems enable some dramatic changes to the way developers approach GPU programming. Most...]]>

The new hardware developments in NVIDIA Grace Hopper Superchip systems enable some dramatic changes to the way developers approach GPU programming. Most...

nvidia-grace-hopper

The new hardware developments in NVIDIA Grace Hopper Superchip systems enable some dramatic changes to the way developers approach GPU programming. Most notably, the bidirectional, high-bandwidth, and cache-coherent connection between CPU and GPU memory means that the user can develop their application for both processors while using a single, unified address space.

]]> 1 John Hubbard <![CDATA[Simplifying GPU Application Development with Heterogeneous Memory Management]]> http://www.open-lab.net/blog/?p=69542 2023-09-13T17:07:34Z 2023-08-22T17:00:00Z

Heterogeneous Memory Management (HMM) is a CUDA memory management feature that extends the simplicity and productivity of the CUDA Unified Memory programming...]]>

Heterogeneous Memory Management (HMM) is a CUDA memory management feature that extends the simplicity and productivity of the CUDA Unified Memory programming...

globe-regions-in-color

]]> 0 Peter Entschev <![CDATA[Debugging a Mixed Python and C Language Stack]]> http://www.open-lab.net/blog/?p=63641 2023-06-09T22:28:19Z 2023-04-20T17:00:00Z

Debugging is difficult. Debugging across multiple languages is especially challenging, and debugging across devices often requires a team with varying skill...]]>

Debugging is difficult. Debugging across multiple languages is especially challenging, and debugging across devices often requires a team with varying skill...

debugging-python-c-screenshot

Debugging is difficult. Debugging across multiple languages is especially challenging, and debugging across devices often requires a team with varying skill sets and expertise to reveal the underlying problem. Yet projects often require using multiple languages, to ensure high performance where necessary, a user-friendly experience, and compatibility where possible. Unfortunately��

]]> 0 Michelle Horton <![CDATA[Just Released: CUTLASS v2.9]]> http://www.open-lab.net/blog/?p=49527 2023-06-12T09:25:17Z 2022-06-23T16:34:29Z

]]>

]]> 0 Rob Van der Wijngaart <![CDATA[Boosting Application Performance with GPU Memory Prefetching]]> http://www.open-lab.net/blog/?p=45713 2023-06-12T20:54:17Z 2022-03-23T15:02:00Z

NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, because GPUs also...]]>

NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, because GPUs also...

CUDA Blog Image 1000x600

NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, because GPUs also have high memory bandwidth, but sometimes they need your help to saturate that bandwidth. In this post, we examine one specific method to accomplish that: prefetching. We explain the circumstances under which prefetching can be expected��

]]> 7 Conor Hoekstra <![CDATA[Implementing High-Precision Decimal Arithmetic with CUDA int128]]> http://www.open-lab.net/blog/?p=43367 2022-08-21T23:53:19Z 2022-02-10T16:00:00Z

��Truth is much too complicated to allow anything but approximations.�� -- John von Neumann The history of computing has demonstrated that there is no limit...]]>

��Truth is much too complicated to allow anything but approximations.�� -- John von Neumann The history of computing has demonstrated that there is no limit...

Cuda_Featuring Image

��Truth is much too complicated to allow anything but approximations.�� John von Neumann The history of computing has demonstrated that there is no limit to what can be achieved with the relatively simple arithmetic implemented in computer hardware. But the ��truth�� that computers represent using finite-size numbers is fundamentally approximate. As David Goldberg wrote��

]]> 0 Mark Harris <![CDATA[Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager]]> http://www.open-lab.net/blog/?p=22554 2022-08-21T23:40:48Z 2020-12-08T19:27:00Z

When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high...]]>

When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high... Image depicting NVIDIA CEO Jen-Hsun Huang explaining the importance of the RAPIDS launch demo at GTC Europe 2018.

Image depicting NVIDIA CEO Jen-Hsun Huang explaining the importance of the RAPIDS launch demo at GTC Europe 2018.

When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high frequency, because its APIs generally create new and s rather than modifying them in place. The overhead of and synchronization of was holding RAPIDS back. My first task for RAPIDS was to help with this problem, so I created a rough��

]]> 9 Mark Harris <![CDATA[Unified Memory for CUDA Beginners]]> http://www.open-lab.net/blog/parallelforall/?p=7937 2022-08-21T23:38:11Z 2017-06-20T03:59:57Z

My previous introductory post, "An Even Easier Introduction to CUDA C++", introduced the basics of CUDA programming by showing how to write a simple program...]]>

My previous introductory post, "An Even Easier Introduction to CUDA C++", introduced the basics of CUDA programming by showing how to write a simple program...

CUDA_Cube_1K

]]> 46 Nikolay Sakharnykh <![CDATA[High-Performance Geometric Multi-Grid with GPU Acceleration]]> http://www.open-lab.net/blog/parallelforall/?p=6313 2023-02-10T22:34:08Z 2016-02-23T10:11:05Z

Linear solvers are probably the most common tool in scientific computing applications. There are two basic classes of methods that can be used to solve an...]]>

Linear solvers are probably the most common tool in scientific computing applications. There are two basic classes of methods that can be used to solve an...

hpgmg_featured3

Linear solvers are probably the most common tool in scientific computing applications. There are two basic classes of methods that can be used to solve an equation: direct and iterative. Direct methods are usually robust, but have additional computational complexity and memory capacity requirements. Unlike direct solvers, iterative solvers require minimal memory overhead and feature better��

]]> 5 Brad Nemire <![CDATA[Cutting Edge Parallel Algorithms Research with CUDA]]> http://www.open-lab.net/blog/parallelforall/?p=6004 2022-08-21T23:37:38Z 2015-10-20T05:21:58Z

Leyuan Wang, a Ph.D. student in the UC Davis Department of Computer Science, presented one of only two ��Distinguished Papers�� of the 51 accepted at Euro-Par...]]>

Leyuan Wang, a Ph.D. student in the UC Davis Department of Computer Science, presented one of only two ��Distinguished Papers�� of the 51 accepted at Euro-Par...

gpu_computing_spotlight_358x230

Leyuan Wang, a Ph.D. student in the UC Davis Department of Computer Science, presented one of only two ��Distinguished Papers�� of the 51 accepted at Euro-Par 2015. Euro-Par is a European conference devoted to all aspects of parallel and distributed processing held August 24-28 at Austria��s Vienna University of Technology. Leyuan��s paper Fast Parallel Suffix Array on the GPU, co-authored by her��

]]> 3 Cory Simon http://corymsimon.com/ <![CDATA[Accelerating Materials Discovery with CUDA]]> http://www.open-lab.net/blog/parallelforall/?p=5977 2022-08-21T23:37:38Z 2015-10-13T02:07:01Z

In this post, we discuss how CUDA has facilitated materials research in the Department of Chemical and Biomolecular Engineering at UC Berkeley and Lawrence...]]>

In this post, we discuss how CUDA has facilitated materials research in the Department of Chemical and Biomolecular Engineering at UC Berkeley and Lawrence...

In this post, we discuss how CUDA has facilitated materials research in the Department of Chemical and Biomolecular Engineering at UC Berkeley and Lawrence Berkeley National Laboratory. This post is a collaboration between Cory Simon, Jihan Kim, Richard L. Martin, Maciej Haranczyk, and Berend Smit. Nanoporous materials have nano-sized pores such that only a few molecules can fit inside.

]]> 0 Elmar Westphal <![CDATA[Voting and Shuffling to Optimize Atomic Operations]]> http://www.open-lab.net/blog/parallelforall/?p=5700 2022-08-21T23:37:36Z 2015-08-06T07:24:19Z

2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to...]]>

2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to...

GPUProTip_179x115

2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to simulate hydrodynamic interactions between solvents and solutes. As part of this algorithm, a number of particle parameters are summed to calculate certain cell parameters. This was in the days of the Tesla GPU architecture (such as GT200 GPUs��

]]> 0 John Canny <![CDATA[BIDMach: Machine Learning at the Limit with GPUs]]> http://www.open-lab.net/blog/parallelforall/?p=4913 2022-08-21T23:37:30Z 2015-02-18T18:00:36Z

Deep learning has made enormous leaps forward thanks to GPU hardware. But much Big Data analysis is still done with classical methods on sparse data. Tasks like...]]>

Deep learning has made enormous leaps forward thanks to GPU hardware. But much Big Data analysis is still done with classical methods on sparse data. Tasks like...

Figure 2: Representative Memory hierarchies for CPU and GPU

Deep learning has made enormous leaps forward thanks to GPU hardware. But much Big Data analysis is still done with classical methods on sparse data. Tasks like click prediction, personalization, recommendation, search ranking, etc. still account for most of the revenue from commercial data analysis. The role of GPUs in that realm has been less clear. In the BIDMach project (part of the BID Data��

]]> 0 Jeff Larkin http://jefflarkin.com <![CDATA[3 Versatile OpenACC Interoperability Techniques]]> http://www.open-lab.net/blog/parallelforall/?p=3523 2022-08-21T23:37:08Z 2014-09-02T13:00:16Z

OpenACC is a high-level programming model for accelerating applications with GPUs and other devices using compiler directives compiler directives to specify...]]>

OpenACC is a high-level programming model for accelerating applications with GPUs and other devices using compiler directives compiler directives to specify...

OpenACC is a high-level programming model for accelerating applications with GPUs and other devices using compiler directives compiler directives to specify loops and regions of code in standard C, C++ and Fortran to offload from a host CPU to an attached accelerator. OpenACC simplifies accelerating applications with GPUs. OpenACC tutorial: Three Steps to More Science An often-overlooked��

]]> 6 ��˳��97caoporen��