CUDA Fortran – NVIDIA Technical Blog

CUDA Fortran – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-03-24T20:52:54Z http://www.open-lab.net/blog/feed/ Greg Ruetsch <![CDATA[Pro Tip: Pinpointing Runtime Errors in CUDA Fortran]]> http://www.open-lab.net/blog/parallelforall/?p=8590 2022-08-21T23:38:33Z 2017-11-17T02:03:48Z

[caption id="attachment_2407" align="alignright" width="208"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_2407" align="alignright" width="208"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.

We��ve all been there. Your CUDA Fortran code is humming along and suddenly you get a runtime error: , , usually accompanied by in all caps. In many cases, the error message gives you enough information to find where the problem is in your source code: you have a runtime error and you only perform a few host-to-device transfers, or your code ran fine before you added that block of code earlier��

]]> 2 Massimiliano Fatica <![CDATA[Customize CUDA Fortran Profiling with NVTX]]> http://www.open-lab.net/blog/parallelforall/?p=5951 2022-08-21T23:37:38Z 2015-09-30T01:53:40Z

The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the...]]>

The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the...

GPUProTip_179x115

The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the NVIDIA Visual Profiler (NVVP) and NSight. In my own optimization work, I rely heavily on NVTX to better understand internal as well as customer codes and to spot opportunities for better interaction between the CPU and the GPU.

]]> 4 Jeff Larkin http://jefflarkin.com <![CDATA[3 Versatile OpenACC Interoperability Techniques]]> http://www.open-lab.net/blog/parallelforall/?p=3523 2022-08-21T23:37:08Z 2014-09-02T13:00:16Z

OpenACC is a high-level programming model for accelerating applications with GPUs and other devices using compiler directives compiler directives to specify...]]>

OpenACC is a high-level programming model for accelerating applications with GPUs and other devices using compiler directives compiler directives to specify...

OpenACC is a high-level programming model for accelerating applications with GPUs and other devices using compiler directives compiler directives to specify loops and regions of code in standard C, C++ and Fortran to offload from a host CPU to an attached accelerator. OpenACC simplifies accelerating applications with GPUs. OpenACC tutorial: Three Steps to More Science An often-overlooked��

]]> 6 Mark Harris <![CDATA[10 Ways CUDA 6.5 Improves Performance and Productivity]]> http://www.open-lab.net/blog/parallelforall/?p=3436 2022-10-10T18:42:09Z 2014-08-20T13:00:38Z

Today we're excited to announce the release of the CUDA Toolkit version 6.5. CUDA 6.5 adds a number of features and improvements to the CUDA platform, including...]]>

Today we're excited to announce the release of the CUDA Toolkit version 6.5. CUDA 6.5 adds a number of features and improvements to the CUDA platform, including...

Today we��re excited to announce the release of the CUDA Toolkit version 6.5. CUDA 6.5 adds a number of features and improvements to the CUDA platform, including support for CUDA Fortran in developer tools, user-defined callback functions in cuFFT, new occupancy calculator APIs, and more. Last year we introduced CUDA on Arm, and in March we released the Jetson TK1 developer board��

]]> 21 Mark Harris <![CDATA[Unified Memory: Now for CUDA Fortran Programmers]]> http://www.open-lab.net/blog/parallelforall/?p=3441 2022-08-21T23:37:07Z 2014-08-13T07:08:26Z

Unified Memory is a CUDA feature that we've talked a lot about on Parallel Forall. CUDA 6 introduced Unified Memory, which dramatically simplifies GPU...]]>

Unified Memory is a CUDA feature that we've talked a lot about on Parallel Forall. CUDA 6 introduced Unified Memory, which dramatically simplifies GPU...

cuda_fortran_simple

Unified Memory is a CUDA feature that we��ve talked a lot about on Parallel Forall. CUDA 6 introduced Unified Memory, which dramatically simplifies GPU programming by giving programmers a single pointer to data which is accessible from either the GPU or the CPU. But this enhanced memory model has only been available to CUDA C/C++ programmers, until now. The new PGI Compiler release 14.7��

]]> 2 Greg Ruetsch <![CDATA[CUDA Pro Tip: How to Call Batched cuBLAS routines from CUDA Fortran]]> http://www.open-lab.net/blog/parallelforall/?p=2672 2022-08-21T23:37:03Z 2014-03-06T04:41:20Z

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can... GPU Pro Tip

GPU Pro Tip

When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously. While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM).

]]> 4 Greg Ruetsch <![CDATA[Peer-to-Peer Multi-GPU Transpose in CUDA Fortran (Book Excerpt)]]> http://www.open-lab.net/blog/parallelforall/?p=2361 2022-08-21T23:36:58Z 2014-01-02T06:19:45Z

This post is an excerpt from Chapter 4 of the book?CUDA Fortran for Scientists and Engineers, by Gregory Ruetsch and Massimiliano Fatica. In this excerpt we...]]>

This post is an excerpt from Chapter 4 of the book?CUDA Fortran for Scientists and Engineers, by Gregory Ruetsch and Massimiliano Fatica. In this excerpt we...

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.

This post is an excerpt from Chapter 4 of the book CUDA Fortran for Scientists and Engineers, by Gregory Ruetsch and Massimiliano Fatica. In this excerpt we extend the matrix transpose example from a previous post to operate on a matrix that is distributed across multiple GPUs. The data layout is shown in Figure 1 for an �� = 1024 �� 768 element matrix that is distributed amongst four devices.

]]> 2 Greg Ruetsch <![CDATA[Finite Difference Methods in CUDA Fortran, Part 2]]> http://www.parallelforall.com/?p=1177 2022-08-21T23:36:53Z 2013-04-02T01:26:53Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

In the last CUDA Fortran post we dove in to 3D finite difference computations in CUDA Fortran, demonstrating how to implement the x derivative part of the computation. In this post, let��s continue by exploring how we can write efficient kernels for the y and z derivatives. As with the previous post, code for the examples in this post is available for download on Github. We can easily modify��

]]> 5 Greg Ruetsch <![CDATA[Finite Difference Methods in CUDA Fortran, Part 1]]> http://www.parallelforall.com/?p=713 2022-08-21T23:36:49Z 2013-02-26T18:21:48Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

In the last CUDA Fortran post we investigated how shared memory can be used to optimize a matrix transpose, achieving roughly an order of magnitude improvement in effective bandwidth by using shared memory to coalesce global memory access. The topic of today��s post is to show how to use shared memory to enhance data reuse in a finite difference code. In addition to shared memory��

]]> 0 Greg Ruetsch <![CDATA[An Efficient Matrix Transpose in CUDA Fortran]]> http://www.parallelforall.com/?p=579 2022-08-21T23:36:48Z 2013-02-07T19:42:42Z

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

My previous CUDA Fortran post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance gains achievable using shared memory. Specifically, I will optimize a matrix transpose to show how to use shared memory to reorder strided global memory accesses into coalesced accesses. The code we wish to optimize is a transpose��

]]> 2 Greg Ruetsch <![CDATA[Using Shared Memory in CUDA Fortran]]> http://www.parallelforall.com/?p=548 2023-06-12T21:18:21Z 2013-01-15T12:01:23Z

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...]]>

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...

cuda_fortran_simple

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is problematic regardless of the generation of��

]]> 0 Greg Ruetsch <![CDATA[How to Access Global Memory Efficiently in CUDA Fortran Kernels]]> http://www.parallelforall.com/?p=521 2022-08-21T23:36:48Z 2013-01-04T02:16:42Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

In the previous two posts we looked at how to move data efficiently between the host and device. In this sixth post of our CUDA Fortran series we discuss how to efficiently access device memory, in particular global memory, from within kernels. There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior. So far in this series we have used global��

]]> 0 Greg Ruetsch <![CDATA[How to Overlap Data Transfers in CUDA Fortran]]> http://test.markmark.net/?p=495 2022-08-21T23:36:48Z 2012-12-11T12:35:06Z

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

In my previous CUDA Fortran post I discussed how to transfer data efficiently between the host and device. In this post, I discuss how to overlap data transfers with computation on the host, computation on the device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other operations requires the use of CUDA streams, so first let��s��

]]> 0 Greg Ruetsch <![CDATA[How to Optimize Data Transfers in CUDA Fortran]]> http://test.markmark.net/?p=434 2022-08-21T23:36:47Z 2012-11-29T18:08:36Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

In the previous three posts of this CUDA Fortran series we laid the groundwork for the major thrust of the series: how to optimize CUDA Fortran code. In this and the following post we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak bandwidth between the device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050��

]]> 2 Greg Ruetsch <![CDATA[How to Query Device Properties and Handle Errors in CUDA Fortran]]> http://test.markmark.net/?p=302 2022-08-21T23:36:47Z 2012-11-15T02:43:28Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

In this third post of the CUDA Fortran series we discuss various characteristics of the wide range of CUDA-capable GPUs, how to query device properties from within a CUDA Fortran program, and how to handle errors. In our last post, about performance metrics, we discussed how to compute the theoretical peak bandwidth of a GPU. This calculation used the GPU��s memory clock rate and bus interface��

]]> 2 Greg Ruetsch <![CDATA[How to Implement Performance Metrics in CUDA Fortran]]> http://test.markmark.net/?p=288 2022-08-21T23:36:47Z 2012-11-05T18:41:03Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

In the first post of this series we looked at the basic elements of CUDA Fortran by examining a CUDA Fortran implementation of SAXPY. In this second post we discuss how to analyze the performance of this and other CUDA Fortran codes. We will rely on these performance measurement techniques in future posts where performance optimization will be increasingly important.

]]> 4 Greg Ruetsch <![CDATA[An Easy Introduction to CUDA Fortran]]> http://test.markmark.net/?p=260 2022-08-21T23:36:47Z 2012-10-30T05:07:12Z

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

This post is the first in a series on CUDA Fortran, which is the Fortran interface to the CUDA parallel computing platform. If you are familiar with CUDA C, then you are already well on your way to using CUDA Fortran as it is based on the CUDA C runtime API. There are a few differences in how CUDA concepts are expressed using Fortran 90 constructs, but the programming model for both CUDA Fortran��

]]> 7 ��˳��97caoporen��