Julien Demouth – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2023-07-05T19:42:57Z http://www.open-lab.net/blog/feed/ Julien Demouth <![CDATA[CUTLASS: Fast Linear Algebra in CUDA C++]]> http://www.open-lab.net/blog/parallelforall/?p=8708 2023-02-13T17:46:48Z 2017-12-06T04:03:29Z Update May 21, 2018: CUTLASS 1.0 is now available as Open Source software at the CUTLASS repository. CUTLASS 1.0 has changed substantially from our preview...]]>

Update May 21, 2018: CUTLASS 1.0 is now available as Open Source software at the CUTLASS repository. CUTLASS 1.0 has changed substantially from our preview release described in the blog post below. We have decomposed the structure of the GEMM computation into deeper, structured primitives for loading data, computing predicate masks, streaming data at each level of the GEMM hierarchy…

Source

]]>
13
Julien Demouth <![CDATA[How We Achieved Record Finance Benchmark Performance on Tesla K80]]> http://www.open-lab.net/blog/parallelforall/?p=4148 2023-07-05T19:42:57Z 2014-12-17T05:33:09Z STAC Research develops financial benchmarks in partnership with leading banks and software or hardware vendors. The STAC-A2 suite of benchmarks aims?to...]]>

Source

]]>
0
Julien Demouth <![CDATA[CUDA Pro Tip: Minimize the Tail Effect]]> http://www.open-lab.net/blog/parallelforall/?p=3275 2022-08-21T23:37:05Z 2014-06-04T14:17:42Z When I work on the optimization of CUDA kernels, I sometimes see a discrepancy between Achieved and Theoretical Occupancies. The Theoretical Occupancy is the...]]>

When I work on the optimization of CUDA kernels, I sometimes see a discrepancy between Achieved and Theoretical Occupancies. The Theoretical Occupancy is the ratio between the number of threads which may run on each multiprocessor (SM) and the maximum number of executable threads per SM (2048 on the Kepler architecture). This value is estimated from the size of the blocks and the amount of…

Source

]]>
2
���˳���97caoporen����