An Efficient Matrix Transpose in CUDA C/C++ – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-03-21T20:30:26Z http://www.open-lab.net/blog/feed/ Mark Harris <![CDATA[An Efficient Matrix Transpose in CUDA C/C++]]> http://www.parallelforall.com/?p=1166 2022-08-21T23:36:51Z 2013-02-19T04:49:19Z My last CUDA C++ post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance...]]> My last CUDA C++ post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance...

My last CUDA C++ post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance gains achievable using shared memory. Specifically, I will optimize a matrix transpose to show how to use shared memory to reorder strided global memory accesses into coalesced accesses. The code we wish to optimize is a transpose of a��

Source

]]>
31
���˳���97caoporen����