We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don��t mean just parallelism within one GPU, but also across multiple GPUs and CPUs. It��s common for high-performance software to parallelize across multiple GPUs by assigning one or more CPU threads to each GPU. In this post I��ll cover a common but subtle bug and a simple rule��
]]>