This post was originally published on the RAPIDS AI blog. The RAPIDS Forest Inference Library, affectionately known as FIL, dramatically accelerates inference (prediction) for tree-based models, including gradient-boosted decision tree models (like those from XGBoost and LightGBM) and random forests. (For a deeper dive into the library overall, check out the original FIL blog.
]]>Note: This post has been updated (November 2017) for CUDA 9 and the latest GPUs. The NVCC compiler now performs warp aggregation for atomics automatically in many cases, so you can get higher performance with no extra effort. In fact, the code generated by the compiler is actually faster than the manually-written warp aggregation code. This post is mainly intended for those who want to learn how…
]]>This post concludes an introductory series on CUDA dynamic parallelism. In this post, I finish the series with a case study on an online track reconstruction algorithm for the high-energy physics PANDA experiment. The PANDA work was carried out in the scope of the NVIDIA Application Lab at Jülich. PANDA (anti-Proton ANnihilation at DArmstadt) is a state-of-the-art hadron…
]]>Early CUDA programs had to conform to a flat, bulk parallel programming model. Programs had to perform a sequence of kernel launches, and for best performance each kernel had to expose enough parallelism to efficiently use the GPU. For applications consisting of “parallel for” loops the bulk parallel model is not too limiting, but some parallel patterns—such as nested parallelism—cannot be…
]]>