In the previous two posts we looked at how to move data efficiently between the host and device. In this sixth post of our CUDA Fortran series we discuss how to efficiently access device memory, in particular global memory, from within kernels. There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior. So far in this series we have used global��
]]>