In the last CUDA Fortran post we investigated how shared memory can be used to optimize a matrix transpose, achieving roughly an order of magnitude improvement in effective bandwidth by using shared memory to coalesce global memory access. The topic of today��s post is to show how to use shared memory to enhance data reuse in a finite difference code. In addition to shared memory��
]]>