Greg Ruetsch – NVIDIA Technical Blog

Greg Ruetsch – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2023-06-12T21:18:21Z http://www.open-lab.net/blog/feed/ Greg Ruetsch <![CDATA[Using Tensor Cores in CUDA Fortran]]> http://www.open-lab.net/blog/?p=24627 2023-03-22T01:11:50Z 2021-04-15T21:00:20Z

Tensor Cores, which are programmable matrix multiply and accumulate units, were first introduced in the V100 GPUs where they operated on half-precision (16-bit)...]]>

Tensor Cores, which are programmable matrix multiply and accumulate units, were first introduced in the V100 GPUs where they operated on half-precision (16-bit) multiplicands. Tensor Core functionality has been expanded in the following architectures, and in the Ampere A100 GPUs (compute capability 8.0) support for other data types was added, including double precision.

]]> 1 Greg Ruetsch <![CDATA[Pro Tip: Pinpointing Runtime Errors in CUDA Fortran]]> http://www.open-lab.net/blog/parallelforall/?p=8590 2022-08-21T23:38:33Z 2017-11-17T02:03:48Z

[caption id="attachment_2407" align="alignright" width="208"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

We’ve all been there. Your CUDA Fortran code is humming along and suddenly you get a runtime error: , , usually accompanied by in all caps. In many cases, the error message gives you enough information to find where the problem is in your source code: you have a runtime error and you only perform a few host-to-device transfers, or your code ran fine before you added that block of code earlier…

]]> 2 Greg Ruetsch <![CDATA[CUDA Pro Tip: How to Call Batched cuBLAS routines from CUDA Fortran]]> http://www.open-lab.net/blog/parallelforall/?p=2672 2022-08-21T23:37:03Z 2014-03-06T04:41:20Z

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously. While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM).

]]> 4 Greg Ruetsch <![CDATA[Peer-to-Peer Multi-GPU Transpose in CUDA Fortran (Book Excerpt)]]> http://www.open-lab.net/blog/parallelforall/?p=2361 2022-08-21T23:36:58Z 2014-01-02T06:19:45Z

This post is an excerpt from Chapter 4 of the book?CUDA Fortran for Scientists and Engineers, by Gregory Ruetsch and Massimiliano Fatica. In this excerpt we...]]>

This post is an excerpt from Chapter 4 of the book CUDA Fortran for Scientists and Engineers, by Gregory Ruetsch and Massimiliano Fatica. In this excerpt we extend the matrix transpose example from a previous post to operate on a matrix that is distributed across multiple GPUs. The data layout is shown in Figure 1 for an × = 1024 × 768 element matrix that is distributed amongst four devices.

]]> 2 Greg Ruetsch <![CDATA[Finite Difference Methods in CUDA Fortran, Part 2]]> http://www.parallelforall.com/?p=1177 2022-08-21T23:36:53Z 2013-04-02T01:26:53Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

In the last CUDA Fortran post we dove in to 3D finite difference computations in CUDA Fortran, demonstrating how to implement the x derivative part of the computation. In this post, let’s continue by exploring how we can write efficient kernels for the y and z derivatives. As with the previous post, code for the examples in this post is available for download on Github. We can easily modify…

]]> 5 Greg Ruetsch <![CDATA[Finite Difference Methods in CUDA Fortran, Part 1]]> http://www.parallelforall.com/?p=713 2022-08-21T23:36:49Z 2013-02-26T18:21:48Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

In the last CUDA Fortran post we investigated how shared memory can be used to optimize a matrix transpose, achieving roughly an order of magnitude improvement in effective bandwidth by using shared memory to coalesce global memory access. The topic of today’s post is to show how to use shared memory to enhance data reuse in a finite difference code. In addition to shared memory…

]]> 0 Greg Ruetsch <![CDATA[An Efficient Matrix Transpose in CUDA Fortran]]> http://www.parallelforall.com/?p=579 2022-08-21T23:36:48Z 2013-02-07T19:42:42Z

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

My previous CUDA Fortran post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance gains achievable using shared memory. Specifically, I will optimize a matrix transpose to show how to use shared memory to reorder strided global memory accesses into coalesced accesses. The code we wish to optimize is a transpose…

]]> 2 Greg Ruetsch <![CDATA[Using Shared Memory in CUDA Fortran]]> http://www.parallelforall.com/?p=548 2023-06-12T21:18:21Z 2013-01-15T12:01:23Z

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...]]>

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is problematic regardless of the generation of…

]]> 0 Greg Ruetsch <![CDATA[How to Access Global Memory Efficiently in CUDA Fortran Kernels]]> http://www.parallelforall.com/?p=521 2022-08-21T23:36:48Z 2013-01-04T02:16:42Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

In the previous two posts we looked at how to move data efficiently between the host and device. In this sixth post of our CUDA Fortran series we discuss how to efficiently access device memory, in particular global memory, from within kernels. There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior. So far in this series we have used global…

]]> 0 Greg Ruetsch <![CDATA[How to Overlap Data Transfers in CUDA Fortran]]> http://test.markmark.net/?p=495 2022-08-21T23:36:48Z 2012-12-11T12:35:06Z

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

In my previous CUDA Fortran post I discussed how to transfer data efficiently between the host and device. In this post, I discuss how to overlap data transfers with computation on the host, computation on the device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other operations requires the use of CUDA streams, so first let’s…

]]> 0 Greg Ruetsch <![CDATA[How to Optimize Data Transfers in CUDA Fortran]]> http://test.markmark.net/?p=434 2022-08-21T23:36:47Z 2012-11-29T18:08:36Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

In the previous three posts of this CUDA Fortran series we laid the groundwork for the major thrust of the series: how to optimize CUDA Fortran code. In this and the following post we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak bandwidth between the device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050…

]]> 2 Greg Ruetsch <![CDATA[How to Query Device Properties and Handle Errors in CUDA Fortran]]> http://test.markmark.net/?p=302 2022-08-21T23:36:47Z 2012-11-15T02:43:28Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

In this third post of the CUDA Fortran series we discuss various characteristics of the wide range of CUDA-capable GPUs, how to query device properties from within a CUDA Fortran program, and how to handle errors. In our last post, about performance metrics, we discussed how to compute the theoretical peak bandwidth of a GPU. This calculation used the GPU’s memory clock rate and bus interface…

]]> 2 Greg Ruetsch <![CDATA[How to Implement Performance Metrics in CUDA Fortran]]> http://test.markmark.net/?p=288 2022-08-21T23:36:47Z 2012-11-05T18:41:03Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

In the first post of this series we looked at the basic elements of CUDA Fortran by examining a CUDA Fortran implementation of SAXPY. In this second post we discuss how to analyze the performance of this and other CUDA Fortran codes. We will rely on these performance measurement techniques in future posts where performance optimization will be increasingly important.

]]> 4 Greg Ruetsch <![CDATA[An Easy Introduction to CUDA Fortran]]> http://test.markmark.net/?p=260 2022-08-21T23:36:47Z 2012-10-30T05:07:12Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

This post is the first in a series on CUDA Fortran, which is the Fortran interface to the CUDA parallel computing platform. If you are familiar with CUDA C, then you are already well on your way to using CUDA Fortran as it is based on the CUDA C runtime API. There are a few differences in how CUDA concepts are expressed using Fortran 90 constructs, but the programming model for both CUDA Fortran…

]]> 7 ��˳��97caoporen��