Rob Van der Wijngaart – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-01-22T17:57:59Z http://www.open-lab.net/blog/feed/ Rob Van der Wijngaart <![CDATA[Improving GPU Performance by Reducing Instruction Cache Misses]]> http://www.open-lab.net/blog/?p=86868 2025-01-22T17:57:59Z 2024-08-08T16:30:00Z GPUs are specially designed to crunch through massive amounts of data at high speed. They have a large amount of compute resources, called streaming...]]>

GPUs are specially designed to crunch through massive amounts of data at high speed. They have a large amount of compute resources, called streaming multiprocessors (SMs), and an array of facilities to keep them fed with data: high bandwidth to memory, sizable data caches, and the capability to switch to other teams of workers (warps) without any overhead if an active team has run out of data.

Source

]]>
4
Rob Van der Wijngaart <![CDATA[Measuring the GPU Occupancy of Multi-stream Workloads]]> http://www.open-lab.net/blog/?p=81074 2025-01-03T00:33:09Z 2024-04-19T16:00:00Z NVIDIA GPUs are becoming increasingly powerful with each new generation. This increase generally comes in two forms. Each streaming multi-processor (SM), the...]]>

NVIDIA GPUs are becoming increasingly powerful with each new generation. This increase generally comes in two forms. Each streaming multi-processor (SM), the workhorse of the GPU, can execute instructions faster and faster, and the memory system can deliver data to the SMs at an ever-increasing pace. At the same time, the number of SMs also typically increases with each generation…

Source

]]>
Rob Van der Wijngaart <![CDATA[Boosting Application Performance with GPU Memory Access Tuning]]> http://www.open-lab.net/blog/?p=47928 2023-06-12T20:34:13Z 2022-06-27T17:50:59Z NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, as GPUs also have...]]>

NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, as GPUs also have high memory bandwidth, but sometimes they need the programmer’s help to saturate that bandwidth. In this post, we examine one method to accomplish that and apply it to an example taken from financial computing.

Source

]]>
12
Rob Van der Wijngaart <![CDATA[Boosting Application Performance with GPU Memory Prefetching]]> http://www.open-lab.net/blog/?p=45713 2023-06-12T20:54:17Z 2022-03-23T15:02:00Z NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, because GPUs also...]]>

NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, because GPUs also have high memory bandwidth, but sometimes they need your help to saturate that bandwidth. In this post, we examine one specific method to accomplish that: prefetching. We explain the circumstances under which prefetching can be expected…

Source

]]>
7
Rob Van der Wijngaart <![CDATA[Employing CUDA Graphs in a Dynamic Environment]]> http://www.open-lab.net/blog/?p=37199 2022-08-21T23:52:38Z 2021-11-03T21:17:00Z Many workloads can be sped up greatly by offloading compute-intensive parts onto GPUs. In CUDA terms, this is known as launching kernels. When those kernels are...]]>

Many workloads can be sped up greatly by offloading compute-intensive parts onto GPUs. In CUDA terms, this is known as launching kernels. When those kernels are many and of short duration, launch overhead sometimes becomes a problem. One way of reducing that overhead is offered by CUDA Graphs. Graphs work because they combine arbitrary numbers of asynchronous CUDA API calls…

Source

]]>
3
���˳���97caoporen����