Profiling – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-03-21T20:30:26Z http://www.open-lab.net/blog/feed/ Ryan Prescott <![CDATA[Advanced API Performance: SetStablePowerState]]> http://www.open-lab.net/blog/?p=48106 2024-08-28T17:45:35Z 2022-06-28T15:00:00Z This post covers best practices for using SetStablePowerState on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API...]]> This post covers best practices for using SetStablePowerState on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API...A graphic of a computer sending code to multiple stacks.

This post covers best practices for using SetStablePowerState on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. Most modern processors, including GPUs, change processor core and memory clock rates during application execution. These changes can vary performance, introducing errors in measurements and rendering comparisons��

Source

]]>
14
Chaitrali Joshi <![CDATA[Advanced Kernel Profiling with the Latest Nsight Compute]]> http://www.open-lab.net/blog/?p=43640 2024-08-28T17:46:06Z 2022-01-27T17:58:33Z NVIDIA Nsight Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging through a user...]]> NVIDIA Nsight Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging through a user...CUDA-X logo graphic

NVIDIA Nsight Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging through a user interface and a command-line tool. Nsight Compute 2022.1 brings updates to improve data collection modes enabling new use cases and options for performance profiling. Download Now>> This release of Nsight Compute extends the��

Source

]]>
0
Yaki Tebeka <![CDATA[TensorFlow Performance Logging Plugin nvtx-plugins-tf Goes Public]]> http://www.open-lab.net/blog/?p=15258 2024-08-28T17:58:49Z 2019-07-16T13:00:05Z The new nvtx-plugins-tf library enables users to add performance logging nodes to TensorFlow graphs. (TensorFlow is an open source library widely used for training DNN��deep neural network��models). These nodes log performance data using the NVTX (NVIDIA��s Tools Extension) library. The logged performance data can then be viewed in tools such as NVIDIA Nsight Systems and NVIDIA Nsight Compute.

Source

]]>
1
Yaki Tebeka <![CDATA[NVIDIA Nsight Systems Adds Vulkan Support]]> http://www.open-lab.net/blog/?p=14355 2024-08-28T18:25:53Z 2019-04-23T13:00:44Z Vulkan is a low-overhead, cross-platform 3D graphics and compute API targeting a wide variety of devices from cloud gaming servers, to PCs and embedded...]]> Vulkan is a low-overhead, cross-platform 3D graphics and compute API targeting a wide variety of devices from cloud gaming servers, to PCs and embedded...

Vulkan is a low-overhead, cross-platform 3D graphics and compute API targeting a wide variety of devices from cloud gaming servers, to PCs and embedded platforms. The Khronos Group manages and defines the Vulkan API. NVIDIA NsightSystems provides developers with a unified timeline view which displays how applications use computer resources. This low-overhead performance analysis tool helps��

Source

]]>
1
Daniel Horowitz <![CDATA[Nsight Systems Exposes New GPU Optimization Opportunities]]> http://www.open-lab.net/blog/?p=10488 2024-08-28T17:59:37Z 2018-05-30T23:07:56Z As GPU performance steadily ramps up, your application may be overdue for a tune-up to keep pace. Developers have used independent CPU profilers and GPU...]]> As GPU performance steadily ramps up, your application may be overdue for a tune-up to keep pace. Developers have used independent CPU profilers and GPU...

As GPU performance steadily ramps up, your application may be overdue for a tune-up to keep pace. Developers have used independent CPU profilers and GPU profilers in search of bottlenecks and optimization opportunities across their disjointed datasets for years. Using these independent tools can result in picking small optimizations based on false positive indicators or missing large opportunities��

Source

]]>
1
Mark Harris <![CDATA[CUDA 8 Features Revealed]]> http://www.open-lab.net/blog/parallelforall/?p=6554 2022-08-21T23:37:50Z 2016-04-05T12:00:11Z Today I'm excited to announce the general availability of CUDA 8, the latest update to NVIDIA's powerful parallel computing?platform and programming model. In...]]> Today I'm excited to announce the general availability of CUDA 8, the latest update to NVIDIA's powerful parallel computing?platform and programming model. In...

Today I��m excited to announce the general availability of CUDA 8, the latest update to NVIDIA��s powerful parallel computing platform and programming model. In this post I��ll give a quick overview of the major new features of CUDA 8. To learn more you can watch the recording of my talk from GTC 2016, ��CUDA 8 and Beyond��. A crucial goal for CUDA 8 is to provide support for the powerful new��

Source

]]>
51
Massimiliano Fatica <![CDATA[Customize CUDA Fortran Profiling with NVTX]]> http://www.open-lab.net/blog/parallelforall/?p=5951 2022-08-21T23:37:38Z 2015-09-30T01:53:40Z The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the...]]> The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the...

The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the NVIDIA Visual Profiler (NVVP) and NSight. In my own optimization work, I rely heavily on NVTX to better understand internal as well as customer codes and to spot opportunities for better interaction between the CPU and the GPU.

Source

]]>
4
Swapna Matwankar <![CDATA[CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling]]> http://www.open-lab.net/blog/parallelforall/?p=5840 2022-08-21T23:37:37Z 2015-09-08T07:01:23Z [Note: Thejaswi Rao also contributed to the code optimizations shown in this post.] Today NVIDIA released CUDA 7.5, the latest release of the powerful CUDA...]]> [Note: Thejaswi Rao also contributed to the code optimizations shown in this post.] Today NVIDIA released CUDA 7.5, the latest release of the powerful CUDA...

[Note: Thejaswi Rao also contributed to the code optimizations shown in this post.] Today NVIDIA released CUDA 7.5, the latest release of the powerful CUDA Toolkit. One of the most exciting new features in CUDA 7.5 is new Instruction-Level Profiling support in the NVIDIA Visual Profiler. This powerful new feature, available on Maxwell (GM200) and later GPUs��

Source

]]>
14
Mark Harris <![CDATA[New Features in CUDA 7.5]]> http://www.open-lab.net/blog/parallelforall/?p=5529 2023-02-13T18:15:18Z 2015-07-08T07:01:34Z Today I'm happy to announce that the CUDA Toolkit 7.5 Release Candidate is now available. The CUDA Toolkit 7.5 adds support for FP16 storage for up to 2x larger...]]> Today I'm happy to announce that the CUDA Toolkit 7.5 Release Candidate is now available. The CUDA Toolkit 7.5 adds support for FP16 storage for up to 2x larger...

Today I��m happy to announce that the CUDA Toolkit 7.5 Release Candidate is now available. The CUDA Toolkit 7.5 adds support for FP16 storage for up to 2x larger data sets and reduced memory bandwidth, cuSPARSE GEMVI routines, instruction-level profiling and more. Read on for full details. CUDA 7.5 expands support for 16-bit floating point (FP16) data storage and arithmetic��

Source

]]>
66
Jeff Larkin http://jefflarkin.com <![CDATA[GPU Pro Tip: Track MPI Calls In The NVIDIA Visual Profiler]]> http://www.open-lab.net/blog/parallelforall/?p=5177 2022-08-21T23:37:32Z 2015-05-06T02:30:13Z Often when profiling GPU-accelerated applications that run on clusters, one needs to visualize MPI?(Message Passing Interface) calls on the GPU timeline in the...]]> Often when profiling GPU-accelerated applications that run on clusters, one needs to visualize MPI?(Message Passing Interface) calls on the GPU timeline in the...GPU Pro Tip

Often when profiling GPU-accelerated applications that run on clusters, one needs to visualize MPI (Message Passing Interface) calls on the GPU timeline in the profiler. While tools like Vampir and Tau will allow programmers to see a big picture view of how a parallel application performs, sometimes all you need is a look at how MPI is affecting GPU performance on a single node using a simple tool��

Source

]]>
2
Mark Ebersole http://www.open-lab.net/blog/parallelforall <![CDATA[Learn GPU Computing with Hands-On Labs at GTC 2015]]> http://www.open-lab.net/blog/parallelforall/?p=4927 2022-08-21T23:37:30Z 2015-02-23T22:59:58Z Every year NVIDIA��s GPU Technology Conference (GTC) gets bigger and better. One of the aims of GTC is to give developers, scientists, and practitioners...]]> Every year NVIDIA��s GPU Technology Conference (GTC) gets bigger and better. One of the aims of GTC is to give developers, scientists, and practitioners...

Every year NVIDIA��s GPU Technology Conference (GTC) gets bigger and better. One of the aims of GTC is to give developers, scientists, and practitioners opportunities to learn with hands-on labs how to use accelerated computing in their work. This year we are nearly doubling the amount of hands-on training provided from last year, with almost 2,400 lab hours available to GTC attendees!

Source

]]>
0
Mark Harris <![CDATA[GPU Pro Tip: CUDA 7 Streams Simplify Concurrency]]> http://www.open-lab.net/blog/parallelforall/?p=4286 2022-08-21T23:37:29Z 2015-01-23T03:46:33Z Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions...]]> Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions...GPU Pro Tip

Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions concurrently on multiple processors. CUDA Applications manage concurrency by executing asynchronous commands in streams, sequences of commands that execute in order. Different streams may execute their commands concurrently or out of order with��

Source

]]>
51
Satish Salian <![CDATA[Remote Application Development using NVIDIA Nsight Eclipse Edition]]> http://www.open-lab.net/blog/parallelforall/?p=3483 2024-08-28T18:00:05Z 2014-08-26T01:11:51Z NVIDIA Nsight Eclipse Edition (NSEE) is a full-featured unified CPU+GPU integrated development environment(IDE) that lets you easily develop CUDA applications...]]> NVIDIA Nsight Eclipse Edition (NSEE) is a full-featured unified CPU+GPU integrated development environment(IDE) that lets you easily develop CUDA applications...

NVIDIA Nsight Eclipse Edition (NSEE) is a full-featured unified CPU+GPU integrated development environment(IDE) that lets you easily develop CUDA applications for either your local (x86_64) system or a remote (x86_64 or ARM) target system. In my last post on remote development of CUDA applications, I covered NSEE��s cross compilation mode. In this post I will focus on the using NSEE��s synchronized��

Source

]]>
65
Patric Zhao <![CDATA[Accelerate R Applications with CUDA]]> http://www.open-lab.net/blog/parallelforall/?p=3369 2022-08-21T23:37:06Z 2014-08-05T02:13:24Z R is a free software environment for statistical computing and graphics that provides a programming language and built-in libraries of mathematics operations...]]> R is a free software environment for statistical computing and graphics that provides a programming language and built-in libraries of mathematics operations...

R is a free software environment for statistical computing and graphics that provides a programming language and built-in libraries of mathematics operations for statistics, data analysis, machine learning and much more. Many domain experts and researchers use the R platform and contribute R software, resulting in a large ecosystem of free software packages available through CRAN (the��

Source

]]>
19
Jiri Kraus <![CDATA[CUDA Pro Tip: Profiling MPI Applications]]> http://www.open-lab.net/blog/parallelforall/?p=3313 2022-08-21T23:37:06Z 2014-06-19T19:05:55Z When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it's necessary to identify the MPI rank where...]]> When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it's necessary to identify the MPI rank where...GPU Pro Tip

When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it��s necessary to identify the MPI rank where the performance issue occurs. Before CUDA 6.5 it was hard to do this because the CUDA profiler only shows the PID of the processes and leaves the developer to figure out the mapping from PIDs to MPI ranks. Although the mapping can be done��

Source

]]>
1
Jiri Kraus <![CDATA[Accelerating a C++ CFD Code with OpenACC]]> http://www.open-lab.net/blog/parallelforall/?p=2741 2022-08-21T23:37:03Z 2014-06-03T13:51:44Z Computational Fluid Dynamics (CFD) is a valuable tool to study the behavior of fluids. Today, many areas of engineering use CFD. For example, the automotive...]]> Computational Fluid Dynamics (CFD) is a valuable tool to study the behavior of fluids. Today, many areas of engineering use CFD. For example, the automotive...

Computational Fluid Dynamics (CFD) is a valuable tool to study the behavior of fluids. Today, many areas of engineering use CFD. For example, the automotive industry uses CFD to study airflow around cars, and to optimize the car body shapes to reduce drag and improve fuel efficiency. To get accurate results in fluid simulation it is necessary to capture complex phenomena such as turbulence��

Source

]]>
0
Satish Salian <![CDATA[NVIDIA Nsight Eclipse Edition for Jetson TK1]]> http://www.open-lab.net/blog/parallelforall/?p=3255 2024-08-28T18:00:33Z 2014-05-27T17:50:42Z NVIDIA Nsight Eclipse Edition is a full-featured, integrated development environment that lets you easily develop CUDA applications for either your local (x86)...]]> NVIDIA Nsight Eclipse Edition is a full-featured, integrated development environment that lets you easily develop CUDA applications for either your local (x86)...

NVIDIA Nsight Eclipse Edition is a full-featured, integrated development environment that lets you easily develop CUDA applications for either your local (x86) system or a remote (x86 or Arm) target. In this post, I will walk you through the process of remote-developing CUDA applications for the NVIDIA Jetson TK1, an Arm-based development kit. Nsight supports two remote development modes: cross��

Source

]]>
103
Cliff Woolley <![CDATA[CUDA Pro Tip: Improve NVIDIA Visual Profiler Loading of Large Profiles]]> http://www.open-lab.net/blog/parallelforall/?p=3213 2024-12-10T17:13:44Z 2014-05-06T21:03:51Z Post updated on December 10, 2024. NVIDIA has deprecated nvprof and NVIDIA Visual Profiler and these tools are not supported on current GPU architectures. The...]]> Post updated on December 10, 2024. NVIDIA has deprecated nvprof and NVIDIA Visual Profiler and these tools are not supported on current GPU architectures. The...GPU Pro Tip

Post updated on December 10, 2024. NVIDIA has deprecated nvprof and NVIDIA Visual Profiler and these tools are not supported on current GPU architectures. The original post still applies to previous GPU architectures, up to and including Volta. For Volta and newer architectures, profile your applications with NVIDIA Nsight Compute and NVIDIA Nsight Systems. For more information about how to��

Source

]]>
4
Mark Ebersole http://www.open-lab.net/blog/parallelforall <![CDATA[CUDACasts Episode 13: Clock, Power, and Thermal Profiling with Nsight Eclipse Edition]]> http://www.open-lab.net/blog/parallelforall/?p=2358 2024-08-28T18:00:11Z 2013-12-19T16:13:19Z In the world of high-performance computing, it is important to understand how your code affects the operating characteristics of your HW. ?For example, if your...]]> In the world of high-performance computing, it is important to understand how your code affects the operating characteristics of your HW. ?For example, if your...

In the world of high-performance computing, it is important to understand how your code affects the operating characteristics of your HW. For example, if your program executes inefficient code, it may cause the GPU to work harder than it needs to, leading to higher power consumption, and a potential slow-down due to throttling. A new profiling feature in CUDA 5.5 allows you to profile the��

Source

]]>
1
Jiri Kraus <![CDATA[CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX]]> http://www.open-lab.net/blog/parallelforall/?p=2003 2024-08-12T15:49:35Z 2013-09-04T01:49:42Z The last time you used the timeline feature in the NVIDIA Visual Profiler, Nsight VSE or the new Nsight Systems to analyze a complex application, you might have...]]> The last time you used the timeline feature in the NVIDIA Visual Profiler, Nsight VSE or the new Nsight Systems to analyze a complex application, you might have...GPU Pro Tip

The last time you used the timeline feature in the NVIDIA Visual Profiler, Nsight VSE or the new Nsight Systems to analyze a complex application, you might have wished to see a bit more than just CUDA API calls and GPU kernels. In this post I will show you how you can use the NVIDIA Tools Extension (NVTX) to annotate the time line with useful information. I will demonstrate how to add time��

Source

]]>
6
Wolfgang Hoenig <![CDATA[CUDA Pro Tip: View Assembly Code Correlation in Nsight Visual Studio Edition]]> http://www.parallelforall.com/?p=1581 2022-08-21T23:36:54Z 2013-06-24T04:53:59Z While high-level languages for GPU programming like CUDA C offer a useful level of abstraction, convenience, and maintainability, they inherently hide some of...]]> While high-level languages for GPU programming like CUDA C offer a useful level of abstraction, convenience, and maintainability, they inherently hide some of...GPU Pro Tip

While high-level languages for GPU programming like CUDA C offer a useful level of abstraction, convenience, and maintainability, they inherently hide some of the details of the execution on the hardware. It is sometimes helpful to dig into the underlying assembly code that the hardware is executing to explore performance problems, or to make sure the compiler is generating the code you expect.

Source

]]>
0
Mark Harris <![CDATA[CUDA Pro Tip: Clean Up After Yourself to Ensure Correct Profiling]]> http://www.parallelforall.com/?p=1506 2022-08-21T23:36:54Z 2013-05-28T21:07:05Z NVIDIA's profiling and tracing tools, including the NVIDIA Visual Profiler, NSight Eclipse and Visual Studio editions, cuda-memcheck, and the nvprof command...]]> NVIDIA's profiling and tracing tools, including the NVIDIA Visual Profiler, NSight Eclipse and Visual Studio editions, cuda-memcheck, and the nvprof command...GPU Pro Tip

NVIDIA��s profiling and tracing tools, including the NVIDIA Visual Profiler, NSight Eclipse and Visual Studio editions, cuda-memcheck, and the nvprof command line profiler are powerful tools that can give you deep insight into the performance and correctness of your GPU-accelerated applications. These tools gather data while your application is running, and use it to create profiles��

Source

]]>
0
Mark Harris <![CDATA[How to Optimize Data Transfers in CUDA C/C++]]> http://www.parallelforall.com/?p=805 2022-08-21T23:36:49Z 2012-12-05T01:20:31Z In the previous three posts of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this...]]> In the previous three posts of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this...

In the previous three posts of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this and the following post we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak bandwidth between the device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050��

Source

]]>
12
Greg Ruetsch <![CDATA[How to Optimize Data Transfers in CUDA Fortran]]> http://test.markmark.net/?p=434 2022-08-21T23:36:47Z 2012-11-29T18:08:36Z [caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]> [caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

In the previous three posts of this CUDA Fortran series we laid the groundwork for the major thrust of the series: how to optimize CUDA Fortran code. In this and the following post we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak bandwidth between the device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050��

Source

]]>
2
Mark Harris <![CDATA[How to Implement Performance Metrics in CUDA C/C++]]> http://test.markmark.net/?p=390 2023-05-22T22:52:22Z 2012-11-08T04:03:28Z In the first post of this series we looked at the basic elements of CUDA C/C++ by examining a CUDA C/C++ implementation of SAXPY. In this second post we discuss...]]> In the first post of this series we looked at the basic elements of CUDA C/C++ by examining a CUDA C/C++ implementation of SAXPY. In this second post we discuss...

In the first post of this series we looked at the basic elements of CUDA C/C++ by examining a CUDA C/C++ implementation of SAXPY. In this second post we discuss how to analyze the performance of this and other CUDA C/C++ codes. We will rely on these performance measurement techniques in future posts where performance optimization will be increasingly important. CUDA performance measurement is��

Source

]]>
20
Greg Ruetsch <![CDATA[How to Implement Performance Metrics in CUDA Fortran]]> http://test.markmark.net/?p=288 2022-08-21T23:36:47Z 2012-11-05T18:41:03Z [caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]> [caption id="attachment_8972" align="alignright" width="318"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

In the first post of this series we looked at the basic elements of CUDA Fortran by examining a CUDA Fortran implementation of SAXPY. In this second post we discuss how to analyze the performance of this and other CUDA Fortran codes. We will rely on these performance measurement techniques in future posts where performance optimization will be increasingly important.

Source

]]>
4
���˳���97caoporen����