An Efficient Matrix Transpose in CUDA Fortran – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-03-26T22:01:23Z http://www.open-lab.net/blog/feed/ Greg Ruetsch <![CDATA[An Efficient Matrix Transpose in CUDA Fortran]]> http://www.parallelforall.com/?p=579 2022-08-21T23:36:48Z 2013-02-07T19:42:42Z [caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]> [caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

My previous CUDA Fortran post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance gains achievable using shared memory. Specifically, I will optimize a matrix transpose to show how to use shared memory to reorder strided global memory accesses into coalesced accesses. The code we wish to optimize is a transpose��

Source

]]>
2
���˳���97caoporen����