When writing compute shaders, it’s often necessary to communicate values between threads. This is typically done through shared memory. Kepler GPUs introduced shuffle intrinsics, which enable threads of a warp to directly read each other’s registers, avoiding memory access and synchronization. Shared memory is relatively fast but instructions that operate without using memory of any kind are…
]]>