An Efficient Matrix Transpose in CUDA Fortran – NVIDIA Technical Blog

An Efficient Matrix Transpose in CUDA Fortran – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-03-26T22:01:23Z http://www.open-lab.net/blog/feed/ Greg Ruetsch <![CDATA[An Efficient Matrix Transpose in CUDA Fortran]]> http://www.parallelforall.com/?p=579 2022-08-21T23:36:48Z 2013-02-07T19:42:42Z

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]>

[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

cuda_fortran_simple

My previous CUDA Fortran post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance gains achievable using shared memory. Specifically, I will optimize a matrix transpose to show how to use shared memory to reorder strided global memory accesses into coalesced accesses. The code we wish to optimize is a transpose��

]]> 2 ��˳��97caoporen��