Mark Harris – NVIDIA Technical Blog News and tutorials for developers, data scientists, and IT admins 2025-03-17T16:24:00Z http://www.open-lab.net/blog/feed/ Mark Harris <![CDATA[Implementing High-Precision Decimal Arithmetic with CUDA int128]]> http://www.open-lab.net/blog/?p=43367 2022-08-21T23:53:19Z 2022-02-10T16:00:00Z ��Truth is much too complicated to allow anything but approximations.�� -- John von Neumann The history of computing has demonstrated that there is no limit...]]>

“Truth is much too complicated to allow anything but approximations.” — John von Neumann The history of computing has demonstrated that there is no limit to what can be achieved with the relatively simple arithmetic implemented in computer hardware. But the “truth” that computers represent using finite-size numbers is fundamentally approximate. As David Goldberg wrote…

Source

]]>
0
Mark Harris <![CDATA[Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager]]> http://www.open-lab.net/blog/?p=22554 2022-08-21T23:40:48Z 2020-12-08T19:27:00Z When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high...]]>

When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high frequency, because its APIs generally create new and s rather than modifying them in place. The overhead of and synchronization of was holding RAPIDS back. My first task for RAPIDS was to help with this problem, so I created a rough…

Source

]]>
9
Mark Harris <![CDATA[CUDA Pro Tip: The Fast Way to Query Device Properties]]> http://www.open-lab.net/blog/?p=15512 2023-05-22T22:00:37Z 2019-08-20T16:10:48Z CUDA applications often need to know the maximum available shared memory per block or to query the number of multiprocessors in the active GPU. One way to do...]]>

CUDA applications often need to know the maximum available shared memory per block or to query the number of multiprocessors in the active GPU. One way to do this is by calling . Unfortunately, calling this function inside a performance-critical section of your code lead to huge slowdowns, depending on your code. We found out the hard way when caused a 20x slowdown in the Random Forests algorithm…

Source

]]>
6
Mark Harris <![CDATA[RAPIDS Accelerates Data Science End-to-End]]> http://www.open-lab.net/blog/?p=12361 2022-08-21T23:39:09Z 2018-10-15T21:24:31Z Today's data science problems demand a dramatic increase in the scale of data?as well as?the computational power required to process it. Unfortunately, the...]]>

Source

]]>
12
Mark Harris <![CDATA[Cooperative Groups: Flexible CUDA Thread Programming]]> http://www.open-lab.net/blog/parallelforall/?p=8415 2023-06-12T21:16:47Z 2017-10-05T04:17:43Z In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. The...]]>

In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. The granularity of sharing varies from algorithm to algorithm, so thread synchronization should be flexible. Making synchronization an explicit part of the program ensures safety, maintainability, and modularity. CUDA 9 introduces Cooperative Groups…

Source

]]>
32
Mark Harris <![CDATA[Unified Memory for CUDA Beginners]]> http://www.open-lab.net/blog/parallelforall/?p=7937 2022-08-21T23:38:11Z 2017-06-20T03:59:57Z My previous introductory post, "An Even Easier Introduction to CUDA C++", introduced the basics of CUDA programming by showing how to write a simple program...]]>

Source

]]>
46
Mark Harris <![CDATA[CUDA 9 Features Revealed: Volta, Cooperative Groups and More]]> http://www.open-lab.net/blog/parallelforall/?p=7874 2023-02-13T18:15:06Z 2017-05-11T07:18:30Z [caption id="attachment_7875" align="alignright" width="200"] Figure 1: CUDA 9 provides a preview API for programming Tesla V100 Tensor Cores, providing a huge...]]>

At the 2017 GPU Technology Conference NVIDIA announced CUDA 9, the latest version of CUDA’s powerful parallel computing platform and programming model. CUDA 9 is now available as a free download. In this post I’ll provide an overview of the awesome new features of CUDA 9. The CUDA Toolkit version 9.0 is available as a free download. To learn more you can watch the recording of my talk from GTC…

Source

]]>
45
Mark Harris <![CDATA[NVIDIA DGX-1: The Fastest Deep Learning System]]> http://www.open-lab.net/blog/parallelforall/?p=7684 2022-08-21T23:38:08Z 2017-04-05T15:00:55Z [caption id="attachment_7685" align="alignright" width="300"] Figure 1: NVIDIA DGX-1.[/caption] One year ago today, NVIDIA announced the NVIDIA? DGX-1™,...]]>

One year ago today, NVIDIA announced the NVIDIA® DGX-1, an integrated system for deep learning. DGX-1 (shown in Figure 1) features eight Tesla P100 GPU accelerators connected through NVLink, the NVIDIA high-performance GPU interconnect, in a hybrid cube-mesh network. Together with dual socket Intel Xeon CPUs and four 100 Gb InfiniBand network interface cards, DGX-1 provides unprecedented…

Source

]]>
2
Mark Harris <![CDATA[An Even Easier Introduction to CUDA]]> http://www.open-lab.net/blog/parallelforall/?p=7501 2022-08-21T23:38:05Z 2017-01-25T12:31:14Z This post is a super simple introduction to CUDA, the popular parallel computing platform and programming model from NVIDIA. I wrote a previous post, Easy...]]>

Source

]]>
141
Mark Harris <![CDATA[Mixed-Precision Programming with CUDA 8]]> http://www.open-lab.net/blog/parallelforall/?p=7311 2022-08-21T23:38:00Z 2016-10-19T21:30:47Z Update, March 25, 2019: The latest Volta and Turing GPUs now incoporate?Tensor Cores, which accelerate certain types of FP16 matrix math. This enables faster...]]>

Update, March 25, 2019: The latest Volta and Turing GPUs now incoporate Tensor Cores, which accelerate certain types of FP16 matrix math. This enables faster and easier mixed-precision computation within popular AI frameworks. Making use of Tensor Cores requires using CUDA 9 or later. NVIDIA has also added automatic mixed precision capabilities to TensorFlow, PyTorch, and MXNet.

Source

]]>
1
Mark Harris <![CDATA[New Pascal GPUs Accelerate Inference in the Data Center]]> http://www.open-lab.net/blog/parallelforall/?p=7156 2022-08-21T23:37:57Z 2016-09-13T03:01:32Z Artificial intelligence is already more ubiquitous than many people realize. Applications of AI abound, many of them powered by complex deep neural networks...]]>

Artificial intelligence is already more ubiquitous than many people realize. Applications of AI abound, many of them powered by complex deep neural networks trained on massive data using GPUs. These applications understand when you talk to them; they can answer questions; and they can help you find information in ways you couldn’t before. Pinterest image search technology allows users to find…

Source

]]>
3
Mark Harris <![CDATA[Train Your Reinforcement Learning Agents at the OpenAI Gym]]> http://www.open-lab.net/blog/parallelforall/?p=6628 2022-08-21T23:37:51Z 2016-04-27T17:00:51Z Today OpenAI, a non-profit artificial intelligence research company, launched OpenAI Gym,?a toolkit for developing and comparing?reinforcement...]]>

Today OpenAI, a non-profit artificial intelligence research company, launched OpenAI Gym, a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Go. OpenAI researcher John Schulman shared some details about his organization, and how OpenAI Gym will make it easier for AI researchers to design…

Source

]]>
4
Mark Harris <![CDATA[Inside Pascal: NVIDIA��s Newest Computing Platform]]> http://www.open-lab.net/blog/parallelforall/?p=6535 2022-08-21T23:37:50Z 2016-04-05T17:00:44Z At the 2016 GPU Technology Conference in San Jose, NVIDIA CEO Jen-Hsun Huang announced the new NVIDIA Tesla P100, the most advanced accelerator ever built....]]>

At the 2016 GPU Technology Conference in San Jose, NVIDIA CEO Jen-Hsun Huang announced the new NVIDIA Tesla P100, the most advanced accelerator ever built. Based on the new NVIDIA Pascal GP100 GPU and powered by ground-breaking technologies, Tesla P100 delivers the highest absolute performance for HPC, technical computing, deep learning, and many computationally intensive datacenter workloads.

Source

]]>
51
Mark Harris <![CDATA[CUDA 8 Features Revealed]]> http://www.open-lab.net/blog/parallelforall/?p=6554 2022-08-21T23:37:50Z 2016-04-05T12:00:11Z Today I'm excited to announce the general availability of CUDA 8, the latest update to NVIDIA's powerful parallel computing?platform and programming model. In...]]>

Today I’m excited to announce the general availability of CUDA 8, the latest update to NVIDIA’s powerful parallel computing platform and programming model. In this post I’ll give a quick overview of the major new features of CUDA 8. To learn more you can watch the recording of my talk from GTC 2016, “CUDA 8 and Beyond”. A crucial goal for CUDA 8 is to provide support for the powerful new…

Source

]]>
51
Mark Harris <![CDATA[Accelerating Hyperscale Data Center Applications with NVIDIA M40 and M4 GPUs]]> http://www.open-lab.net/blog/parallelforall/?p=6092 2023-09-18T17:40:03Z 2015-11-10T14:02:16Z The internet has changed how people consume media. Rather than just watching television and movies, the combination of ubiquitous mobile devices, massive...]]>

The internet has changed how people consume media. Rather than just watching television and movies, the combination of ubiquitous mobile devices, massive computation, and available Internet bandwidth has led to an explosion in user-created content: users are re-creating the Internet, producing exabytes of content every day. Periscope, a mobile application that lets users broadcast video…

Source

]]>
1
Mark Harris <![CDATA[Performance Portability from GPUs to CPUs with OpenACC]]> http://www.open-lab.net/blog/parallelforall/?p=6043 2022-08-21T23:37:39Z 2015-10-29T22:52:27Z OpenACC gives?scientists and researchers a simple and powerful way to accelerate scientific computing applications incrementally. The OpenACC API describes a...]]>

OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing applications incrementally. The OpenACC API describes a collection of compiler directives to specify loops and regions of code in standard C, C++, and Fortran to be offloaded from a host CPU to an attached accelerator. OpenACC is designed for portability across operating systems, host CPUs…

Source

]]>
4
Mark Harris <![CDATA[Simple, Portable Parallel C++ with Hemi 2 and CUDA 7.5]]> http://www.open-lab.net/blog/parallelforall/?p=5917 2022-08-21T23:37:38Z 2015-09-21T11:44:48Z The last two releases of CUDA have added support for the powerful new features of C++. In the post The Power of C++11 in CUDA 7?I discussed the importance...]]>

The last two releases of CUDA have added support for the powerful new features of C++. In the post The Power of C++11 in CUDA 7 I discussed the importance of C++11 for parallel programming on GPUs, and in the post New Features in CUDA 7.5 I introduced a new experimental feature in the NVCC CUDA C++ compiler: support for GPU Lambda expressions. Lambda expressions, introduced in C++11…

Source

]]>
3
Mark Harris <![CDATA[New Features in CUDA 7.5]]> http://www.open-lab.net/blog/parallelforall/?p=5529 2023-02-13T18:15:18Z 2015-07-08T07:01:34Z Today I'm happy to announce that the CUDA Toolkit 7.5 Release Candidate is now available. The CUDA Toolkit 7.5 adds support for FP16 storage for up to 2x larger...]]>

Today I’m happy to announce that the CUDA Toolkit 7.5 Release Candidate is now available. The CUDA Toolkit 7.5 adds support for FP16 storage for up to 2x larger data sets and reduced memory bandwidth, cuSPARSE GEMVI routines, instruction-level profiling and more. Read on for full details. CUDA 7.5 expands support for 16-bit floating point (FP16) data storage and arithmetic…

Source

]]>
66
Mark Harris <![CDATA[GPU Pro Tip: Fast Great-Circle Distance Calculation in CUDA C++]]> http://www.open-lab.net/blog/parallelforall/?p=5479 2022-08-21T23:37:33Z 2015-06-30T02:26:41Z This post demonstrates the practical utility of CUDA��s sinpi() and cospi() functions in the context of distance calculations on earth. With the advent of...]]>

This post demonstrates the practical utility of CUDA’s and functions in the context of distance calculations on earth. With the advent of location-aware and geospatial applications and geographical information systems (GIS), these distance computations have become commonplace. Wikipedia defines a great circle as For almost any pair of points on the surface of a sphere…

Source

]]>
0
Mark Harris <![CDATA[GPU Pro Tip: Lerp Faster in C++]]> http://www.open-lab.net/blog/parallelforall/?p=5412 2022-08-21T23:37:33Z 2015-06-11T06:14:05Z Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It's so common in computer graphics that programmers often use...]]>

Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It’s so common in computer graphics that programmers often use the verb “lerp” to refer to linear interpolation, a function that’s built into all modern graphics hardware (often in multiple hardware units). You can enable linear interpolation (also known as linear filtering) on texture fetches in…

Source

]]>
25
Mark Harris <![CDATA[C++11 in CUDA: Variadic Templates]]> http://www.open-lab.net/blog/parallelforall/?p=5011 2022-08-21T23:37:31Z 2015-03-27T05:16:56Z CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with nvcc, but...]]>

Source

]]>
6
Mark Harris <![CDATA[The Power of C++11 in CUDA 7]]> http://www.open-lab.net/blog/parallelforall/?p=4999 2022-08-21T23:37:31Z 2015-03-18T08:48:26Z Today I'm excited to announce the official release of CUDA 7, the latest release of the popular CUDA Toolkit. Download the CUDA Toolkit version 7 now from CUDA...]]>

Today I’m excited to announce the official release of CUDA 7, the latest release of the popular CUDA Toolkit. Download the CUDA Toolkit version 7 now from CUDA Zone! CUDA 7 has a huge number of improvements and new features, including C++11 support, the new cuSOLVER library, and support for Runtime Compilation. In a previous post I told you about the features of CUDA 7, so I won’t repeat myself…

Source

]]>
7
Mark Harris <![CDATA[GPU Pro Tip: CUDA 7 Streams Simplify Concurrency]]> http://www.open-lab.net/blog/parallelforall/?p=4286 2022-08-21T23:37:29Z 2015-01-23T03:46:33Z Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions...]]>

Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions concurrently on multiple processors. CUDA Applications manage concurrency by executing asynchronous commands in streams, sequences of commands that execute in order. Different streams may execute their commands concurrently or out of order with…

Source

]]>
51
Mark Harris <![CDATA[CUDA 7 Release Candidate Feature Overview: C++11, New Libraries, and More]]> http://www.open-lab.net/blog/parallelforall/?p=4187 2022-08-21T23:37:29Z 2015-01-13T18:00:47Z It's almost time for the next major release of the CUDA Toolkit, so I'm excited to tell you about the CUDA 7 Release Candidate, now available to all CUDA...]]>

Source

]]>
43
Mark Harris <![CDATA[Porting GPU-Accelerated Applications to POWER8 Systems]]> http://www.open-lab.net/blog/parallelforall/?p=4127 2022-10-10T18:43:24Z 2014-12-02T03:20:20Z With the?US Department of Energy's announcement of plans to base two future flagship supercomputers on IBM POWER CPUs, NVIDIA GPUs, NVIDIA NVLink interconnect,...]]>

With the US Department of Energy’s announcement of plans to base two future flagship supercomputers on IBM POWER CPUs, NVIDIA GPUs, NVIDIA NVLink interconnect, and Mellanox high-speed networking, many developers are getting started building GPU-accelerated applications that run on IBM POWER processors. The good news is that porting existing applications to this platform is easy. In fact…

Source

]]>
2
Mark Harris <![CDATA[How NVLink Will Enable Faster, Easier Multi-GPU Computing]]> http://www.open-lab.net/blog/parallelforall/?p=4058 2022-08-21T23:37:28Z 2014-11-14T15:05:15Z Accelerated systems have become the new standard for high performance computing (HPC) as GPUs continue to raise the bar for both performance and energy...]]>

Accelerated systems have become the new standard for high performance computing (HPC) as GPUs continue to raise the bar for both performance and energy efficiency. In 2012, Oak Ridge National Laboratory announced what was to become the world’s fastest supercomputer, Titan, equipped with one NVIDIA® GPU per CPU – over 18 thousand GPU accelerators. Titan established records not only in absolute…

Source

]]>
10
Mark Harris <![CDATA[12 Things You Should Know about the Tesla Accelerated Computing Platform]]> http://www.open-lab.net/blog/parallelforall/?p=4050 2023-02-13T18:12:43Z 2014-11-11T16:41:27Z You may already know NVIDIA Tesla as a line of GPU accelerator boards optimized for high-performance, general-purpose computing. They are used for parallel...]]>

You may already know NVIDIA Tesla as a line of GPU accelerator boards optimized for high-performance, general-purpose computing. They are used for parallel scientific, engineering, and technical computing, and they are designed for deployment in supercomputers, clusters, and workstations. But it’s not just the GPU boards that make Tesla a great computing solution. The combination of the world’s…

Source

]]>
5
Mark Harris <![CDATA[Maxwell: The Most Advanced CUDA GPU Ever Made]]> http://www.open-lab.net/blog/parallelforall/?p=3669 2022-08-21T23:37:08Z 2014-09-19T02:47:33Z Today NVIDIA introduced the new GM204 GPU, based on the?Maxwell architecture. GM204 is the first GPU based on second-generation Maxwell, the full realization...]]>

Today NVIDIA introduced the new GM204 GPU, based on the Maxwell architecture. GM204 is the first GPU based on second-generation Maxwell, the full realization of the Maxwell architecture. The GeForce GTX 980 and 970 GPUs introduced today are the most advanced gaming and graphics GPUs ever made. But of course they also make fantastic CUDA development GPUs, with full support for CUDA 6.5…

Source

]]>
19
Mark Harris <![CDATA[10 Ways CUDA 6.5 Improves Performance and Productivity]]> http://www.open-lab.net/blog/parallelforall/?p=3436 2022-10-10T18:42:09Z 2014-08-20T13:00:38Z Today we're excited to announce the release of the CUDA Toolkit version 6.5. CUDA 6.5 adds a number of features and improvements to the CUDA platform, including...]]>

Today we’re excited to announce the release of the CUDA Toolkit version 6.5. CUDA 6.5 adds a number of features and improvements to the CUDA platform, including support for CUDA Fortran in developer tools, user-defined callback functions in cuFFT, new occupancy calculator APIs, and more. Last year we introduced CUDA on Arm, and in March we released the Jetson TK1 developer board…

Source

]]>
21
Mark Harris <![CDATA[Unified Memory: Now for CUDA Fortran Programmers]]> http://www.open-lab.net/blog/parallelforall/?p=3441 2022-08-21T23:37:07Z 2014-08-13T07:08:26Z Unified Memory is a CUDA feature that we've talked a lot about on Parallel Forall. CUDA 6 introduced Unified Memory, which dramatically simplifies GPU...]]>

Unified Memory is a CUDA feature that we’ve talked a lot about on Parallel Forall. CUDA 6 introduced Unified Memory, which dramatically simplifies GPU programming by giving programmers a single pointer to data which is accessible from either the GPU or the CPU. But this enhanced memory model has only been available to CUDA C/C++ programmers, until now. The new PGI Compiler release 14.7…

Source

]]>
2
Mark Harris <![CDATA[CUDA Pro Tip: Occupancy API Simplifies Launch Configuration]]> http://www.open-lab.net/blog/parallelforall/?p=3366 2022-08-21T23:37:06Z 2014-07-18T04:43:39Z CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and...]]>

CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor…

Source

]]>
12
Mark Harris <![CDATA[CUDA Pro Tip: Fast and Robust Computation of Givens Rotations]]> http://www.open-lab.net/blog/parallelforall/?p=3140 2022-08-21T23:37:04Z 2014-04-29T17:59:10Z A Givens rotation [1] represents a rotation in a plane represented by a matrix of the form $latex G(i, j, \theta) = \begin{bmatrix} 1 & \cdots & 0 &...]]>

A Givens rotation [1] represents a rotation in a plane represented by a matrix of the form , where the intersections of the th and th columns contain the values and . Multiplying a vector by a Givens rotation matrix represents a rotation of the vector in the plane by radians. According to Wikipedia, the main use of Givens rotations in numerical linear algebra is to introduce zeros in…

Source

]]>
2
Mark Harris <![CDATA[Jetson TK1: Mobile Embedded Supercomputer Takes CUDA Everywhere]]> http://www.open-lab.net/blog/parallelforall/?p=3108 2022-08-21T23:37:04Z 2014-04-04T05:56:56Z Today, cars are learning to see pedestrians and road hazards; robots are becoming higher functioning; complex medical diagnostic devices are becoming more...]]>

Today, cars are learning to see pedestrians and road hazards; robots are becoming higher functioning; complex medical diagnostic devices are becoming more portable; and unmanned aircraft are learning to navigate autonomously. As a result, the computational requirements for these devices are increasing exponentially, while their size, weight, and power limits continue to decrease.

Source

]]>
54
Mark Harris <![CDATA[CUDA Pro Tip: Increase Application Performance with NVIDIA GPU Boost]]> http://www.open-lab.net/blog/parallelforall/?p=3090 2022-08-21T23:37:03Z 2014-03-20T05:56:19Z NVIDIA GPU Boost™ is a feature available on NVIDIA? GeForce? products and?NVIDIA? Tesla? products. It makes use of any power headroom to boost...]]>

NVIDIA GPU Boost is a feature available on NVIDIA® GeForce® products and NVIDIA® Tesla® products. It makes use of any power headroom to boost application performance. In the case of Tesla, the NVIDIA GPU Boost feature is customized for compute intensive workloads running on clusters. This application note is useful for anyone who wants to take advantage of the power headroom on the Tesla K40 in a…

Source

]]>
0
Mark Harris <![CDATA[5 Things You Should Know About the New Maxwell GPU Architecture]]> http://www.open-lab.net/blog/parallelforall/?p=2726 2022-10-10T18:40:14Z 2014-02-21T23:28:09Z [Be sure to check out Maxwell: The Most Advanced CUDA GPU Ever Made, a newer post about the second-generation Maxwell GPU architecture.] The introduction this...]]>

[Be sure to check out Maxwell: The Most Advanced CUDA GPU Ever Made, a newer post about the second-generation Maxwell GPU architecture.] The introduction this week of NVIDIA’s first-generation “Maxwell” GPUs is a very exciting moment for GPU computing. These first Maxwell products, such as the GeForce GTX 750 Ti, are based on the GM107 GPU and are designed for use in low-power environments such…

Source

]]>
10
Mark Harris <![CDATA[CUDA Pro Tip: Do The Kepler Shuffle]]> http://www.open-lab.net/blog/parallelforall/?p=2626 2022-08-21T23:37:03Z 2014-02-03T11:00:00Z When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use...]]>

When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use shared memory. But the NVIDIA Kepler GPU architecture introduced a way to directly share data between threads that are part of the same warp. On Kepler, threads of a warp can read each others’ registers by using a new instruction called SHFL…

Source

]]>
3
Mark Harris <![CDATA[CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES]]> http://www.open-lab.net/blog/parallelforall/?p=2503 2022-08-21T23:37:02Z 2014-01-28T01:47:24Z As a CUDA developer, you will often need to control which devices your application uses. In a short-but-sweet post on the Acceleware blog, Chris Mason writes:...]]>

As a CUDA developer, you will often need to control which devices your application uses. In a short-but-sweet post on the Acceleware blog, Chris Mason writes: As Chris points out, robust applications should use the CUDA API to enumerate and select devices with appropriate capabilities at run time. To learn how, read the section on Device Enumeration in the CUDA Programming Guide.

Source

]]>
3
Mark Harris <![CDATA[Unified Memory in CUDA 6]]> http://www.open-lab.net/blog/parallelforall/?p=2221 2022-08-21T23:36:58Z 2013-11-18T15:59:27Z With CUDA 6, NVIDIA introduced one of the most dramatic programming model improvements in the history of the CUDA platform, Unified Memory. In a typical PC or...]]>

With CUDA 6, NVIDIA introduced one of the most dramatic programming model improvements in the history of the CUDA platform, Unified Memory. In a typical PC or cluster node today, the memories of the CPU and GPU are physically distinct and separated by the PCI-Express bus. Before CUDA 6, that is exactly how the programmer has to view things. Data that is shared between the CPU and GPU must be…

Source

]]>
87
Mark Harris <![CDATA[Numba: High-Performance Python with CUDA Acceleration]]> http://www.open-lab.net/blog/parallelforall/?p=2028 2022-08-21T23:36:57Z 2013-09-19T15:32:22Z [stextbox id="info"]Looking for more? Check out the hands-on DLI training course: Fundamentals of Accelerated Computing with CUDA Python[/stextbox] [Note, this...]]>

Looking for more? Check out the hands-on DLI training course: Fundamentals of Accelerated Computing with CUDA Python [Note, this post was originally published September 19, 2013. It was updated on September 19, 2017.] Python is a high-productivity dynamic programming language that is widely used in science, engineering, and data analytics applications. There are a number of factors…

Source

]]>
16
Mark Harris <![CDATA[Prototyping Algorithms and Testing CUDA Kernels in MATLAB]]> http://www.parallelforall.com/?p=1701 2022-08-21T23:36:55Z 2013-07-15T15:20:02Z This guest post by Daniel Armyr and Dan Doherty from?MathWorks?describes how you can use MATLAB to support your development of CUDA C and C++ kernels. You...]]>

NVIDIA GPUs are becoming increasingly popular for large-scale computations in image processing, financial modeling, signal processing, and other applications—largely due to their highly parallel architecture and high computational throughput. The CUDA programming model lets programmers exploit the full power of this architecture by providing fine-grained control over how computations are divided…

Source

]]>
3
Mark Harris <![CDATA[Develop on your Notebook with GeForce, Deploy on Tesla]]> http://www.parallelforall.com/?p=1548 2022-08-21T23:36:54Z 2013-06-07T00:05:45Z There's a new post over on the NVIDIA Corporate Blog by my colleague Mark Ebersole about the latest line of laptops powered by new GeForce 700-series GPUs. As...]]>

There’s a new post over on the NVIDIA Corporate Blog by my colleague Mark Ebersole about the latest line of laptops powered by new GeForce 700-series GPUs. As Mark explains, the GeForce 700 series (GT 730M, GT 735M, and GT 740M), powered by the low-power GK208 GPU has the latest compute features of the Tesla K20 (powered by the GK110 GPU), including: The availability of the latest GPU…

Source

]]>
3
Mark Harris <![CDATA[CUDA Pro Tip: Understand Fat Binaries and JIT Caching]]> http://www.parallelforall.com/?p=1531 2022-08-21T23:36:54Z 2013-06-05T00:41:31Z As NVIDIA GPUs evolve to support new features, the instruction set architecture naturally changes. Because applications must?run on multiple generations of...]]>

As NVIDIA GPUs evolve to support new features, the instruction set architecture naturally changes. Because applications must run on multiple generations of GPUs, the NVIDIA compiler tool chain supports compiling for multiple architectures in the same application executable or library. CUDA also relies on the PTX virtual GPU ISA to provide forward compatibility, so that already deployed…

Source

]]>
1
Mark Harris <![CDATA[CUDA Pro Tip: Clean Up After Yourself to Ensure Correct Profiling]]> http://www.parallelforall.com/?p=1506 2022-08-21T23:36:54Z 2013-05-28T21:07:05Z NVIDIA's profiling and tracing tools, including the NVIDIA Visual Profiler, NSight Eclipse and Visual Studio editions, cuda-memcheck, and the nvprof command...]]>

NVIDIA’s profiling and tracing tools, including the NVIDIA Visual Profiler, NSight Eclipse and Visual Studio editions, cuda-memcheck, and the nvprof command line profiler are powerful tools that can give you deep insight into the performance and correctness of your GPU-accelerated applications. These tools gather data while your application is running, and use it to create profiles…

Source

]]>
0
Mark Harris <![CDATA[CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops]]> http://www.parallelforall.com/?p=1443 2025-03-17T16:24:00Z 2013-04-23T06:59:24Z One of the most common tasks in CUDA programming is to parallelize a loop using a kernel. As an example, let��s use our old friend?SAXPY. Here's the basic...]]>

Source

]]>
18
Mark Harris <![CDATA[Finite Difference Methods in CUDA C++, Part 2]]> http://www.parallelforall.com/?p=1399 2022-08-21T23:36:53Z 2013-04-09T06:47:04Z In the previous CUDA C++ post we dove in to 3D finite difference computations in CUDA C/C++, demonstrating how to implement the x?derivative part of the...]]>

In the previous CUDA C++ post we dove in to 3D finite difference computations in CUDA C/C++, demonstrating how to implement the x derivative part of the computation. In this post, let’s continue by exploring how we can write efficient kernels for the y and z derivatives. As with the previous post, code for the examples in this post is available for download on Github. We can easily modify the…

Source

]]>
3
Mark Harris <![CDATA[Finite Difference Methods in CUDA C/C++, Part 1]]> http://www.parallelforall.com/?p=1230 2022-08-21T23:36:53Z 2013-03-04T16:54:19Z In the previous CUDA C/C++ post we investigated how we can use shared memory to optimize a matrix transpose, achieving roughly an order of magnitude improvement...]]>

In the previous CUDA C/C++ post we investigated how we can use shared memory to optimize a matrix transpose, achieving roughly an order of magnitude improvement in effective bandwidth by using shared memory to coalesce global memory access. The topic of today’s post is to show how to use shared memory to enhance data reuse in a finite difference code. In addition to shared memory…

Source

]]>
13
Mark Harris <![CDATA[An Efficient Matrix Transpose in CUDA C/C++]]> http://www.parallelforall.com/?p=1166 2022-08-21T23:36:51Z 2013-02-19T04:49:19Z My last CUDA C++ post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance...]]>

My last CUDA C++ post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance gains achievable using shared memory. Specifically, I will optimize a matrix transpose to show how to use shared memory to reorder strided global memory accesses into coalesced accesses. The code we wish to optimize is a transpose of a…

Source

]]>
31
Mark Harris <![CDATA[Using Shared Memory in CUDA C/C++]]> http://www.parallelforall.com/?p=964 2022-08-21T23:36:50Z 2013-01-29T07:18:11Z In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...]]>

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is problematic regardless of the generation of the CUDA…

Source

]]>
36
Mark Harris <![CDATA[CUDA Pro Tip: Flush Denormals with Confidence]]> http://www.parallelforall.com/?p=938 2022-08-21T23:36:49Z 2013-01-10T22:02:07Z I want to keep this post fairly brief, so I will only give minimal background on floating point numbers. If you need a refresher on floating point...]]>

I want to keep this post fairly brief, so I will only give minimal background on floating point numbers. If you need a refresher on floating point representation, I recommend starting with the Wikipedia entry on floating point, and for more detail about NVIDIA GPU floating point, check out this excellent white paper. The Wikipedia entry on denormal numbers is a good start for this post…

Source

]]>
0
Mark Harris <![CDATA[How to Access Global Memory Efficiently in CUDA C/C++ Kernels]]> http://www.parallelforall.com/?p=926 2022-08-21T23:36:49Z 2013-01-08T07:13:44Z In the previous two posts we looked at how to move data efficiently between the host and device. In this sixth post of our CUDA C/C++ series we discuss how to...]]>

In the previous two posts we looked at how to move data efficiently between the host and device. In this sixth post of our CUDA C/C++ series we discuss how to efficiently access device memory, in particular global memory, from within kernels. There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior. So far in this series we have used global…

Source

]]>
7
Mark Harris <![CDATA[How to Overlap Data Transfers in CUDA C/C++]]> http://www.parallelforall.com/?p=883 2022-08-21T23:36:49Z 2012-12-14T02:24:51Z In our last CUDA C/C++ post we discussed how to transfer data efficiently between the host and device.  In this post, we discuss how to overlap data...]]>

In our last CUDA C/C++ post we discussed how to transfer data efficiently between the host and device. In this post, we discuss how to overlap data transfers with computation on the host, computation on the device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other operations requires the use of CUDA streams, so first let’s learn…

Source

]]>
23
Mark Harris <![CDATA[How to Optimize Data Transfers in CUDA C/C++]]> http://www.parallelforall.com/?p=805 2022-08-21T23:36:49Z 2012-12-05T01:20:31Z In the previous three posts of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this...]]>

In the previous three posts of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this and the following post we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak bandwidth between the device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050…

Source

]]>
12
Mark Harris <![CDATA[How to Query Device Properties and Handle Errors in CUDA C/C++]]> http://test.markmark.net/?p=459 2022-08-21T23:36:48Z 2012-11-22T07:01:04Z In this third post of the CUDA C/C++ series, we discuss various characteristics of the wide range of CUDA-capable GPUs, how to query device properties from...]]>

In this third post of the CUDA C/C++ series, we discuss various characteristics of the wide range of CUDA-capable GPUs, how to query device properties from within a CUDA C/C++ program, and how to handle errors. In our last post, about performance metrics, we discussed how to compute the theoretical peak bandwidth of a GPU. This calculation used the GPU’s memory clock rate and bus interface…

Source

]]>
3
Mark Harris <![CDATA[How to Implement Performance Metrics in CUDA C/C++]]> http://test.markmark.net/?p=390 2023-05-22T22:52:22Z 2012-11-08T04:03:28Z In the first post of this series we looked at the basic elements of CUDA C/C++ by examining a CUDA C/C++ implementation of SAXPY. In this second post we discuss...]]>

In the first post of this series we looked at the basic elements of CUDA C/C++ by examining a CUDA C/C++ implementation of SAXPY. In this second post we discuss how to analyze the performance of this and other CUDA C/C++ codes. We will rely on these performance measurement techniques in future posts where performance optimization will be increasingly important. CUDA performance measurement is…

Source

]]>
20
Mark Harris <![CDATA[An Easy Introduction to CUDA C and C++]]> http://test.markmark.net/?p=316 2023-05-22T22:49:47Z 2012-10-31T08:20:21Z Update (January 2017): Check out a new, even easier introduction to CUDA! This post is the first in a series on CUDA C and C++, which is the C/C++ interface to...]]>

This post is the first in a series on CUDA C and C++, which is the C/C++ interface to the CUDA parallel computing platform. This series of posts assumes familiarity with programming in C. We will be running a parallel series of posts about CUDA Fortran targeted at Fortran programmers . These two series will cover the basic concepts of parallel computing on the CUDA platform. From here on unless I…

Source

]]>
48
Mark Harris <![CDATA[Six Ways to SAXPY]]> http://www.parallelforall.com/?p=40 2023-02-13T18:13:03Z 2012-07-02T11:03:25Z For even more ways to SAXPY using the latest NVIDIA HPC SDK with standard language parallelism, see N Ways to SAXPY: Demonstrating the Breadth of GPU...]]>

Source

]]>
17
Mark Harris <![CDATA[Expressive Algorithmic Programming with Thrust]]> http://www.parallelforall.com/?p=29 2022-10-10T18:41:21Z 2012-06-06T00:06:11Z Thrust is a parallel algorithms library which resembles the C++ Standard Template Library (STL). Thrust's High-Level interface greatly enhances...]]>

Source

]]>
2
Mark Harris <![CDATA[An OpenACC Example (Part 2)]]> http://www.parallelforall.com/?p=21 2023-05-18T22:12:51Z 2012-03-26T06:39:14Z You may want to read?the more?recent post?Getting Started with OpenACC?by Jeff Larkin. In?my previous post?I added 3 lines of OpenACC directives to a...]]>

You may want to read the more recent post Getting Started with OpenACC by Jeff Larkin. In my previous post I added 3 lines of OpenACC directives to a Jacobi iteration code, achieving more than 2x speedup by running it on a GPU. In this post I’ll continue where I left off and demonstrate how we can use OpenACC directives clauses to take more explicit control over how the compiler parallelizes our…

Source

]]>
2
Mark Harris <![CDATA[An OpenACC Example (Part 1)]]> http://www.parallelforall.com/?p=19 2023-05-18T22:12:40Z 2012-03-20T06:37:33Z You may want to read the more recent post Getting Started with OpenACC?by Jeff Larkin. In this post I'll continue where I left off in my?introductory...]]>

You may want to read the more recent post Getting Started with OpenACC by Jeff Larkin. In this post I’ll continue where I left off in my introductory post about OpenACC and provide a somewhat more realistic example. This simple C/Fortran code example demonstrates a 2x speedup with the addition of just a few lines of OpenACC directives, and in the next post I’ll add just a few more lines to push…

Source

]]>
0
Mark Harris <![CDATA[OpenACC: Directives for GPUs]]> http://www.parallelforall.com/?p=12 2022-08-21T23:36:44Z 2012-03-13T05:56:45Z NVIDIA has made a lot of progress with CUDA over the past five years; we estimate that there are over 150,000 CUDA developers, and important science is being accomplished with the help of CUDA. But we have a long way to go to help everyone benefit from GPU computing. There are many programmers who can’t afford the time to learn and apply a parallel programming language. Others…

Source

]]>
0
Mark Harris <![CDATA[Accelerated Solution of Sparse Linear Systems]]> http://www.parallelforall.com/?p=1031 2023-06-12T21:17:08Z 2011-06-23T04:09:18Z Fresh from the NVIDIA Numeric Libraries Team, a white paper illustrating the use of the CUSPARSE and CUBLAS libraries to achieve a 2x speedup of incomplete-LU-...]]>

Fresh from the NVIDIA Numeric Libraries Team, a white paper illustrating the use of the CUSPARSE and CUBLAS libraries to achieve a 2x speedup of incomplete-LU- and Cholesky-preconditioned iterative methods. The paper focuses on the Bi-Conjugate Gradient and stabilized Conjugate Gradient iterative methods that can be used to solve large sparse non-symmetric and symmetric positive definite linear…

Source

]]>
1
Mark Harris <![CDATA[Everything You Ever Wanted to Know About Floating Point but Were Afraid to Ask]]> http://www.parallelforall.com/?p=1028 2024-05-04T00:19:42Z 2011-06-07T21:04:37Z This post was written by Nathan Whitehead A few days ago, a friend came to me with a question about floating point.?Let me start by saying that my friend knows...]]>

This post was written by Nathan Whitehead A few days ago, a friend came to me with a question about floating point. Let me start by saying that my friend knows his stuff, he doesn’t ask stupid questions. So he had my attention. He was working on some biosciences simulation code and was getting answers of a different precision than he expected on the GPU and wanted to know what was up.

Source

]]>
2
���˳���97caoporen����