Pro Tip – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-03-21T20:30:26Z http://www.open-lab.net/blog/feed/ Thejaswi Rao <![CDATA[CUDA Pro Tip: The Fast Way to Query Device Properties]]> http://www.open-lab.net/blog/?p=15512 2023-05-22T22:00:37Z 2019-08-20T16:10:48Z CUDA applications often need to know the maximum available shared memory per block or to query the number of multiprocessors in the active GPU. One way to do...]]> CUDA applications often need to know the maximum available shared memory per block or to query the number of multiprocessors in the active GPU. One way to do...

CUDA applications often need to know the maximum available shared memory per block or to query the number of multiprocessors in the active GPU. One way to do this is by calling . Unfortunately, calling this function inside a performance-critical section of your code lead to huge slowdowns, depending on your code. We found out the hard way when caused a 20x slowdown in the Random Forests algorithm��

Source

]]>
6
Christoph Kubisch <![CDATA[Pro Tip: Improved GLSL Syntax for Vulkan DescriptorSet Indexing]]> http://www.open-lab.net/blog/?p=14413 2023-02-13T17:46:10Z 2019-04-29T13:00:01Z Sometimes the evolution of programming languages creates situations where "simple" tasks take a bit more complexity to express. Syntax annoyance slows down...]]> Sometimes the evolution of programming languages creates situations where "simple" tasks take a bit more complexity to express. Syntax annoyance slows down...

Sometimes the evolution of programming languages creates situations where ��simple�� tasks take a bit more complexity to express. Syntax annoyance slows down development or can negatively affect readability of code during maintenance. With this in mind, we recently released an open-source sample of a GLSL header generator for DescriptorSet-indexed types in Vulkan. For example, look at ray tracing��

Source

]]>
1
Greg Ruetsch <![CDATA[Pro Tip: Pinpointing Runtime Errors in CUDA Fortran]]> http://www.open-lab.net/blog/parallelforall/?p=8590 2022-08-21T23:38:33Z 2017-11-17T02:03:48Z [caption id="attachment_2407" align="alignright" width="208"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]> [caption id="attachment_2407" align="alignright" width="208"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.

We��ve all been there. Your CUDA Fortran code is humming along and suddenly you get a runtime error: , , usually accompanied by in all caps. In many cases, the error message gives you enough information to find where the problem is in your source code: you have a runtime error and you only perform a few host-to-device transfers, or your code ran fine before you added that block of code earlier��

Source

]]>
2
Tom Fogal <![CDATA[Pro Tip: Linking OpenGL for Server-Side Rendering]]> http://www.open-lab.net/blog/parallelforall/?p=8276 2022-08-21T23:38:25Z 2017-08-16T20:22:32Z Visualization is a great tool for understanding large amounts of data, but transferring the data from an HPC system or from the cloud to a local workstation for...]]> Visualization is a great tool for understanding large amounts of data, but transferring the data from an HPC system or from the cloud to a local workstation for...Figure 1: Server-side analysis and visualization of thermal operating bounds in vehicle design, using Intelligent Light��s FieldView.

Visualization is a great tool for understanding large amounts of data, but transferring the data from an HPC system or from the cloud to a local workstation for analysis can be a painful experience. It��s increasingly popular to avoid the transfer by analyzing and visualizing data in situ: right where it is generated. Moreover, using server-side rendering lets you deliver high quality visual��

Source

]]>
2
Cris Cecka <![CDATA[Pro Tip: cuBLAS Strided Batched Matrix Multiply]]> http://www.open-lab.net/blog/parallelforall/?p=7561 2022-08-21T23:38:07Z 2017-02-28T03:39:17Z There��s a new computational workhorse in town. For decades, general matrix-matrix multiply��known as GEMM in Basic Linear Algebra Subroutines (BLAS)...]]> There��s a new computational workhorse in town. For decades, general matrix-matrix multiply��known as GEMM in Basic Linear Algebra Subroutines (BLAS)...GPU Pro Tip

There��s a new computational workhorse in town. For decades, general matrix-matrix multiply��known as GEMM in Basic Linear Algebra Subroutines (BLAS) libraries��has been a standard benchmark for computational performance. GEMM is possibly the most optimized and widely used routine in scientific computing. Expert implementations are available for every architecture and quickly achieve the peak��

Source

]]>
11
Massimiliano Fatica <![CDATA[Customize CUDA Fortran Profiling with NVTX]]> http://www.open-lab.net/blog/parallelforall/?p=5951 2022-08-21T23:37:38Z 2015-09-30T01:53:40Z The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the...]]> The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the...

The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the NVIDIA Visual Profiler (NVVP) and NSight. In my own optimization work, I rely heavily on NVTX to better understand internal as well as customer codes and to spot opportunities for better interaction between the CPU and the GPU.

Source

]]>
4
Elmar Westphal <![CDATA[Voting and Shuffling to Optimize Atomic Operations]]> http://www.open-lab.net/blog/parallelforall/?p=5700 2022-08-21T23:37:36Z 2015-08-06T07:24:19Z 2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to...]]> 2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to...

2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to simulate hydrodynamic interactions between solvents and solutes. As part of this algorithm, a number of particle parameters are summed to calculate certain cell parameters. This was in the days of the Tesla GPU architecture (such as GT200 GPUs��

Source

]]>
0
Mark Harris <![CDATA[GPU Pro Tip: Fast Great-Circle Distance Calculation in CUDA C++]]> http://www.open-lab.net/blog/parallelforall/?p=5479 2022-08-21T23:37:33Z 2015-06-30T02:26:41Z This post demonstrates the practical utility of CUDA��s sinpi() and cospi() functions in the context of distance calculations on earth. With the advent of...]]> This post demonstrates the practical utility of CUDA��s sinpi() and cospi() functions in the context of distance calculations on earth. With the advent of...GPU Pro Tip

This post demonstrates the practical utility of CUDA��s and functions in the context of distance calculations on earth. With the advent of location-aware and geospatial applications and geographical information systems (GIS), these distance computations have become commonplace. Wikipedia defines a great circle as For almost any pair of points on the surface of a sphere��

Source

]]>
0
Mark Harris <![CDATA[GPU Pro Tip: Lerp Faster in C++]]> http://www.open-lab.net/blog/parallelforall/?p=5412 2022-08-21T23:37:33Z 2015-06-11T06:14:05Z Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It's so common in computer graphics that programmers often use...]]> Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It's so common in computer graphics that programmers often use...GPU Pro Tip

Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It��s so common in computer graphics that programmers often use the verb ��lerp�� to refer to linear interpolation, a function that��s built into all modern graphics hardware (often in multiple hardware units). You can enable linear interpolation (also known as linear filtering) on texture fetches in��

Source

]]>
25
Nikolay Sakharnykh <![CDATA[GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell]]> http://www.open-lab.net/blog/parallelforall/?p=4175 2022-08-21T23:37:29Z 2015-03-17T16:34:16Z Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical...]]> Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical...

Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical representation of the data distribution across predefined bins. The input data set and the number of bins can vary greatly depending on the domain, so let��s focus on one of the most common use cases: an image histogram using 256 bins for each��

Source

]]>
10
Maxim Milakov <![CDATA[GPU Pro Tip: Fast Dynamic Indexing of Private Arrays in CUDA]]> http://www.open-lab.net/blog/parallelforall/?p=4893 2022-08-21T23:37:30Z 2015-02-11T09:16:03Z Sometimes you need to use small per-thread arrays in your GPU kernels. The performance of accessing elements in these arrays can vary depending on a number of...]]> Sometimes you need to use small per-thread arrays in your GPU kernels. The performance of accessing elements in these arrays can vary depending on a number of...GPU Pro Tip

Sometimes you need to use small per-thread arrays in your GPU kernels. The performance of accessing elements in these arrays can vary depending on a number of factors. In this post I��ll cover several common scenarios ranging from fast static indexing to more complex and challenging use cases. Before discussing dynamic indexing let��s briefly look at static indexing. For small arrays where all��

Source

]]>
8
Mark Harris <![CDATA[GPU Pro Tip: CUDA 7 Streams Simplify Concurrency]]> http://www.open-lab.net/blog/parallelforall/?p=4286 2022-08-21T23:37:29Z 2015-01-23T03:46:33Z Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions...]]> Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions...GPU Pro Tip

Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions concurrently on multiple processors. CUDA Applications manage concurrency by executing asynchronous commands in streams, sequences of commands that execute in order. Different streams may execute their commands concurrently or out of order with��

Source

]]>
51
Andy Adinets <![CDATA[CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics]]> http://www.open-lab.net/blog/parallelforall/?p=3906 2022-08-21T23:37:27Z 2014-10-02T05:57:09Z Note: This post has been updated (November 2017) for CUDA 9 and the latest GPUs. The NVCC compiler now performs warp aggregation for atomics automatically in...]]> Note: This post has been updated (November 2017) for CUDA 9 and the latest GPUs. The NVCC compiler now performs warp aggregation for atomics automatically in...GPU Pro Tip

Note: This post has been updated (November 2017) for CUDA 9 and the latest GPUs. The NVCC compiler now performs warp aggregation for atomics automatically in many cases, so you can get higher performance with no extra effort. In fact, the code generated by the compiler is actually faster than the manually-written warp aggregation code. This post is mainly intended for those who want to learn how��

Source

]]>
8
Christoph Angerer <![CDATA[CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing]]> http://www.open-lab.net/blog/parallelforall/?p=3736 2022-08-21T23:37:09Z 2014-09-24T12:55:51Z Digital signal processing (DSP) applications commonly transform input data before performing an FFT, or transform output data afterwards. For example, if the...]]> Digital signal processing (DSP) applications commonly transform input data before performing an FFT, or transform output data afterwards. For example, if the...GPU Pro Tip

Digital signal processing (DSP) applications commonly transform input data before performing an FFT, or transform output data afterwards. For example, if the input data is supplied as low-resolution samples from an 8-bit analog-to-digital (A/D) converter, the samples may first have to be expanded into 32-bit floating point numbers before the FFT and the rest of the processing pipeline can start.

Source

]]>
18
Justin Luitjens <![CDATA[CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs]]> http://www.open-lab.net/blog/parallelforall/?p=3619 2022-08-21T23:37:08Z 2014-09-05T00:07:17Z We often say that to reach?high performance on GPUs you should expose as much parallelism in your code as possible, and we don't mean just parallelism...]]> We often say that to reach?high performance on GPUs you should expose as much parallelism in your code as possible, and we don't mean just parallelism...GPU Pro Tip

We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don��t mean just parallelism within one GPU, but also across multiple GPUs and CPUs. It��s common for high-performance software to parallelize across multiple GPUs by assigning one or more CPU threads to each GPU. In this post I��ll cover a common but subtle bug and a simple rule��

Source

]]>
4
Jeremy Appleyard <![CDATA[CUDA Pro Tip: Optimize for Pointer Aliasing]]> http://www.open-lab.net/blog/parallelforall/?p=3431 2022-08-21T23:37:07Z 2014-08-08T01:29:25Z Often cited as the main reason that na?ve C/C++ code cannot match FORTRAN performance, pointer aliasing is an important topic to understand when considering...]]> Often cited as the main reason that na?ve C/C++ code cannot match FORTRAN performance, pointer aliasing is an important topic to understand when considering...GPU Pro Tip

Often cited as the main reason that na?ve C/C++ code cannot match FORTRAN performance, pointer aliasing is an important topic to understand when considering optimizations for your C/C++ code. In this tip I will describe what pointer aliasing is and a simple way to alter your code so that it does not harm your application performance. Two pointers alias if the memory to which they point��

Source

]]>
13
Mark Harris <![CDATA[CUDA Pro Tip: Occupancy API Simplifies Launch Configuration]]> http://www.open-lab.net/blog/parallelforall/?p=3366 2022-08-21T23:37:06Z 2014-07-18T04:43:39Z CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and...]]> CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and...GPU Pro Tip

CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor��

Source

]]>
12
Jiri Kraus <![CDATA[CUDA Pro Tip: Profiling MPI Applications]]> http://www.open-lab.net/blog/parallelforall/?p=3313 2022-08-21T23:37:06Z 2014-06-19T19:05:55Z When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it's necessary to identify the MPI rank where...]]> When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it's necessary to identify the MPI rank where...GPU Pro Tip

When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it��s necessary to identify the MPI rank where the performance issue occurs. Before CUDA 6.5 it was hard to do this because the CUDA profiler only shows the PID of the processes and leaves the developer to figure out the mapping from PIDs to MPI ranks. Although the mapping can be done��

Source

]]>
1
Julien Demouth <![CDATA[CUDA Pro Tip: Minimize the Tail Effect]]> http://www.open-lab.net/blog/parallelforall/?p=3275 2022-08-21T23:37:05Z 2014-06-04T14:17:42Z When I work on the optimization of CUDA kernels, I sometimes see a discrepancy between Achieved and Theoretical Occupancies. The Theoretical Occupancy is the...]]> When I work on the optimization of CUDA kernels, I sometimes see a discrepancy between Achieved and Theoretical Occupancies. The Theoretical Occupancy is the...GPU Pro Tip

When I work on the optimization of CUDA kernels, I sometimes see a discrepancy between Achieved and Theoretical Occupancies. The Theoretical Occupancy is the ratio between the number of threads which may run on each multiprocessor (SM) and the maximum number of executable threads per SM (2048 on the Kepler architecture). This value is estimated from the size of the blocks and the amount of��

Source

]]>
2
Cliff Woolley <![CDATA[CUDA Pro Tip: Improve NVIDIA Visual Profiler Loading of Large Profiles]]> http://www.open-lab.net/blog/parallelforall/?p=3213 2024-12-10T17:13:44Z 2014-05-06T21:03:51Z Post updated on December 10, 2024. NVIDIA has deprecated nvprof and NVIDIA Visual Profiler and these tools are not supported on current GPU architectures. The...]]> Post updated on December 10, 2024. NVIDIA has deprecated nvprof and NVIDIA Visual Profiler and these tools are not supported on current GPU architectures. The...GPU Pro Tip

Post updated on December 10, 2024. NVIDIA has deprecated nvprof and NVIDIA Visual Profiler and these tools are not supported on current GPU architectures. The original post still applies to previous GPU architectures, up to and including Volta. For Volta and newer architectures, profile your applications with NVIDIA Nsight Compute and NVIDIA Nsight Systems. For more information about how to��

Source

]]>
4
Mark Harris <![CDATA[CUDA Pro Tip: Fast and Robust Computation of Givens Rotations]]> http://www.open-lab.net/blog/parallelforall/?p=3140 2022-08-21T23:37:04Z 2014-04-29T17:59:10Z A Givens rotation [1] represents a rotation in a plane represented by a matrix of the form $latex G(i, j, \theta) = \begin{bmatrix} 1 & \cdots & 0 &...]]> A Givens rotation [1] represents a rotation in a plane represented by a matrix of the form $latex G(i, j, \theta) = \begin{bmatrix} 1 & \cdots & 0 &...GPU Pro Tip

A Givens rotation [1] represents a rotation in a plane represented by a matrix of the form , where the intersections of the th and th columns contain the values and . Multiplying a vector by a Givens rotation matrix represents a rotation of the vector in the plane by radians. According to Wikipedia, the main use of Givens rotations in numerical linear algebra is to introduce zeros in��

Source

]]>
2
Mark Harris <![CDATA[CUDA Pro Tip: Increase Application Performance with NVIDIA GPU Boost]]> http://www.open-lab.net/blog/parallelforall/?p=3090 2022-08-21T23:37:03Z 2014-03-20T05:56:19Z NVIDIA GPU Boost™ is a feature available on NVIDIA? GeForce? products and?NVIDIA? Tesla? products. It makes use of any power headroom to boost...]]> NVIDIA GPU Boost™ is a feature available on NVIDIA? GeForce? products and?NVIDIA? Tesla? products. It makes use of any power headroom to boost...GPU Pro Tip

NVIDIA GPU Boost is a feature available on NVIDIA? GeForce? products and NVIDIA? Tesla? products. It makes use of any power headroom to boost application performance. In the case of Tesla, the NVIDIA GPU Boost feature is customized for compute intensive workloads running on clusters. This application note is useful for anyone who wants to take advantage of the power headroom on the Tesla K40 in a��

Source

]]>
0
Greg Ruetsch <![CDATA[CUDA Pro Tip: How to Call Batched cuBLAS routines from CUDA Fortran]]> http://www.open-lab.net/blog/parallelforall/?p=2672 2022-08-21T23:37:03Z 2014-03-06T04:41:20Z [caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...]]> [caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...GPU Pro Tip

When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously. While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM).

Source

]]>
4
Mark Harris <![CDATA[CUDA Pro Tip: Do The Kepler Shuffle]]> http://www.open-lab.net/blog/parallelforall/?p=2626 2022-08-21T23:37:03Z 2014-02-03T11:00:00Z When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use...]]> When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use...

When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use shared memory. But the NVIDIA Kepler GPU architecture introduced a way to directly share data between threads that are part of the same warp. On Kepler, threads of a warp can read each others�� registers by using a new instruction called SHFL��

Source

]]>
3
Mark Harris <![CDATA[CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES]]> http://www.open-lab.net/blog/parallelforall/?p=2503 2022-08-21T23:37:02Z 2014-01-28T01:47:24Z As a CUDA developer, you will often need to control which devices your application uses. In a short-but-sweet post on the Acceleware blog, Chris Mason writes:...]]> As a CUDA developer, you will often need to control which devices your application uses. In a short-but-sweet post on the Acceleware blog, Chris Mason writes:...GPU Pro Tip

As a CUDA developer, you will often need to control which devices your application uses. In a short-but-sweet post on the Acceleware blog, Chris Mason writes: As Chris points out, robust applications should use the CUDA API to enumerate and select devices with appropriate capabilities at run time. To learn how, read the section on Device Enumeration in the CUDA Programming Guide.

Source

]]>
3
Justin Luitjens <![CDATA[CUDA Pro Tip: Increase Performance with Vectorized Memory Access]]> http://www.open-lab.net/blog/parallelforall/?p=2287 2022-08-21T23:36:58Z 2013-12-04T18:37:25Z Many CUDA kernels are bandwidth bound, and the increasing ratio of flops to bandwidth in new hardware results in more bandwidth bound kernels. This makes it...]]> Many CUDA kernels are bandwidth bound, and the increasing ratio of flops to bandwidth in new hardware results in more bandwidth bound kernels. This makes it...GPU Pro Tip

Source

]]>
23
Jiri Kraus <![CDATA[CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX]]> http://www.open-lab.net/blog/parallelforall/?p=2003 2024-08-12T15:49:35Z 2013-09-04T01:49:42Z The last time you used the timeline feature in the NVIDIA Visual Profiler, Nsight VSE or the new Nsight Systems to analyze a complex application, you might have...]]> The last time you used the timeline feature in the NVIDIA Visual Profiler, Nsight VSE or the new Nsight Systems to analyze a complex application, you might have...GPU Pro Tip

The last time you used the timeline feature in the NVIDIA Visual Profiler, Nsight VSE or the new Nsight Systems to analyze a complex application, you might have wished to see a bit more than just CUDA API calls and GPU kernels. In this post I will show you how you can use the NVIDIA Tools Extension (NVTX) to annotate the time line with useful information. I will demonstrate how to add time��

Source

]]>
6
Wolfgang Hoenig <![CDATA[CUDA Pro Tip: View Assembly Code Correlation in Nsight Visual Studio Edition]]> http://www.parallelforall.com/?p=1581 2022-08-21T23:36:54Z 2013-06-24T04:53:59Z While high-level languages for GPU programming like CUDA C offer a useful level of abstraction, convenience, and maintainability, they inherently hide some of...]]> While high-level languages for GPU programming like CUDA C offer a useful level of abstraction, convenience, and maintainability, they inherently hide some of...GPU Pro Tip

While high-level languages for GPU programming like CUDA C offer a useful level of abstraction, convenience, and maintainability, they inherently hide some of the details of the execution on the hardware. It is sometimes helpful to dig into the underlying assembly code that the hardware is executing to explore performance problems, or to make sure the compiler is generating the code you expect.

Source

]]>
0
Mark Harris <![CDATA[CUDA Pro Tip: Understand Fat Binaries and JIT Caching]]> http://www.parallelforall.com/?p=1531 2022-08-21T23:36:54Z 2013-06-05T00:41:31Z As NVIDIA GPUs evolve to support new features, the instruction set architecture naturally changes. Because applications must?run on multiple generations of...]]> As NVIDIA GPUs evolve to support new features, the instruction set architecture naturally changes. Because applications must?run on multiple generations of...GPU Pro Tip

As NVIDIA GPUs evolve to support new features, the instruction set architecture naturally changes. Because applications must run on multiple generations of GPUs, the NVIDIA compiler tool chain supports compiling for multiple architectures in the same application executable or library. CUDA also relies on the PTX virtual GPU ISA to provide forward compatibility, so that already deployed��

Source

]]>
1
Mark Harris <![CDATA[CUDA Pro Tip: Clean Up After Yourself to Ensure Correct Profiling]]> http://www.parallelforall.com/?p=1506 2022-08-21T23:36:54Z 2013-05-28T21:07:05Z NVIDIA's profiling and tracing tools, including the NVIDIA Visual Profiler, NSight Eclipse and Visual Studio editions, cuda-memcheck, and the nvprof command...]]> NVIDIA's profiling and tracing tools, including the NVIDIA Visual Profiler, NSight Eclipse and Visual Studio editions, cuda-memcheck, and the nvprof command...GPU Pro Tip

NVIDIA��s profiling and tracing tools, including the NVIDIA Visual Profiler, NSight Eclipse and Visual Studio editions, cuda-memcheck, and the nvprof command line profiler are powerful tools that can give you deep insight into the performance and correctness of your GPU-accelerated applications. These tools gather data while your application is running, and use it to create profiles��

Source

]]>
0
Mark Harris <![CDATA[CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops]]> http://www.parallelforall.com/?p=1443 2025-03-17T16:24:00Z 2013-04-23T06:59:24Z One of the most common tasks in CUDA programming is to parallelize a loop using a kernel. As an example, let��s use our old friend?SAXPY. Here's the basic...]]> One of the most common tasks in CUDA programming is to parallelize a loop using a kernel. As an example, let��s use our old friend?SAXPY. Here's the basic...GPU Pro Tip

Source

]]>
18
M Clark <![CDATA[CUDA Pro Tip: Kepler Texture Objects Improve Performance and Flexibility]]> http://www.parallelforall.com/?p=969 2022-08-21T23:36:50Z 2013-02-04T04:58:39Z The Kepler architecture introduces texture objects, a new feature that makes textures easier to use and higher performance. Texture References Textures are...]]> The Kepler architecture introduces texture objects, a new feature that makes textures easier to use and higher performance. Texture References Textures are...GPU Pro Tip

The Kepler architecture introduces texture objects, a new feature that makes textures easier to use and higher performance. Textures are likely a familiar concept to anyone who��s done much CUDA programming. A feature from the graphics world, textures are images that are stretched, rotated and pasted on polygons to form the 3D graphics we are familiar with. Using textures for GPU computing has��

Source

]]>
11
Mark Harris <![CDATA[CUDA Pro Tip: Flush Denormals with Confidence]]> http://www.parallelforall.com/?p=938 2022-08-21T23:36:49Z 2013-01-10T22:02:07Z I want to keep this post fairly brief, so I will only give minimal background on floating point numbers. If you need a refresher on floating point...]]> I want to keep this post fairly brief, so I will only give minimal background on floating point numbers. If you need a refresher on floating point...GPU Pro Tip

I want to keep this post fairly brief, so I will only give minimal background on floating point numbers. If you need a refresher on floating point representation, I recommend starting with the Wikipedia entry on floating point, and for more detail about NVIDIA GPU floating point, check out this excellent white paper. The Wikipedia entry on denormal numbers is a good start for this post��

Source

]]>
0
���˳���97caoporen����