“Truth is much too complicated to allow anything but approximations.” — John von Neumann The history of computing has demonstrated that there is no limit to what can be achieved with the relatively simple arithmetic implemented in computer hardware. But the “truth” that computers represent using finite-size numbers is fundamentally approximate. As David Goldberg wrote…
]]>When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high frequency, because its APIs generally create new and s rather than modifying them in place. The overhead of and synchronization of was holding RAPIDS back. My first task for RAPIDS was to help with this problem, so I created a rough…
]]>CUDA applications often need to know the maximum available shared memory per block or to query the number of multiprocessors in the active GPU. One way to do this is by calling . Unfortunately, calling this function inside a performance-critical section of your code lead to huge slowdowns, depending on your code. We found out the hard way when caused a 20x slowdown in the Random Forests algorithm…
]]>In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. The granularity of sharing varies from algorithm to algorithm, so thread synchronization should be flexible. Making synchronization an explicit part of the program ensures safety, maintainability, and modularity. CUDA 9 introduces Cooperative Groups…
]]>At the 2017 GPU Technology Conference NVIDIA announced CUDA 9, the latest version of CUDA’s powerful parallel computing platform and programming model. CUDA 9 is now available as a free download. In this post I’ll provide an overview of the awesome new features of CUDA 9. The CUDA Toolkit version 9.0 is available as a free download. To learn more you can watch the recording of my talk from GTC…
]]>One year ago today, NVIDIA announced the NVIDIA® DGX-1, an integrated system for deep learning. DGX-1 (shown in Figure 1) features eight Tesla P100 GPU accelerators connected through NVLink, the NVIDIA high-performance GPU interconnect, in a hybrid cube-mesh network. Together with dual socket Intel Xeon CPUs and four 100 Gb InfiniBand network interface cards, DGX-1 provides unprecedented…
]]>Update, March 25, 2019: The latest Volta and Turing GPUs now incoporate Tensor Cores, which accelerate certain types of FP16 matrix math. This enables faster and easier mixed-precision computation within popular AI frameworks. Making use of Tensor Cores requires using CUDA 9 or later. NVIDIA has also added automatic mixed precision capabilities to TensorFlow, PyTorch, and MXNet.
]]>Artificial intelligence is already more ubiquitous than many people realize. Applications of AI abound, many of them powered by complex deep neural networks trained on massive data using GPUs. These applications understand when you talk to them; they can answer questions; and they can help you find information in ways you couldn’t before. Pinterest image search technology allows users to find…
]]>Today OpenAI, a non-profit artificial intelligence research company, launched OpenAI Gym, a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Go. OpenAI researcher John Schulman shared some details about his organization, and how OpenAI Gym will make it easier for AI researchers to design…
]]>At the 2016 GPU Technology Conference in San Jose, NVIDIA CEO Jen-Hsun Huang announced the new NVIDIA Tesla P100, the most advanced accelerator ever built. Based on the new NVIDIA Pascal GP100 GPU and powered by ground-breaking technologies, Tesla P100 delivers the highest absolute performance for HPC, technical computing, deep learning, and many computationally intensive datacenter workloads.
]]>Today I’m excited to announce the general availability of CUDA 8, the latest update to NVIDIA’s powerful parallel computing platform and programming model. In this post I’ll give a quick overview of the major new features of CUDA 8. To learn more you can watch the recording of my talk from GTC 2016, “CUDA 8 and Beyond”. A crucial goal for CUDA 8 is to provide support for the powerful new…
]]>The internet has changed how people consume media. Rather than just watching television and movies, the combination of ubiquitous mobile devices, massive computation, and available Internet bandwidth has led to an explosion in user-created content: users are re-creating the Internet, producing exabytes of content every day. Periscope, a mobile application that lets users broadcast video…
]]>OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing applications incrementally. The OpenACC API describes a collection of compiler directives to specify loops and regions of code in standard C, C++, and Fortran to be offloaded from a host CPU to an attached accelerator. OpenACC is designed for portability across operating systems, host CPUs…
]]>The last two releases of CUDA have added support for the powerful new features of C++. In the post The Power of C++11 in CUDA 7 I discussed the importance of C++11 for parallel programming on GPUs, and in the post New Features in CUDA 7.5 I introduced a new experimental feature in the NVCC CUDA C++ compiler: support for GPU Lambda expressions. Lambda expressions, introduced in C++11…
]]>Today I’m happy to announce that the CUDA Toolkit 7.5 Release Candidate is now available. The CUDA Toolkit 7.5 adds support for FP16 storage for up to 2x larger data sets and reduced memory bandwidth, cuSPARSE GEMVI routines, instruction-level profiling and more. Read on for full details. CUDA 7.5 expands support for 16-bit floating point (FP16) data storage and arithmetic…
]]>This post demonstrates the practical utility of CUDA’s and functions in the context of distance calculations on earth. With the advent of location-aware and geospatial applications and geographical information systems (GIS), these distance computations have become commonplace. Wikipedia defines a great circle as For almost any pair of points on the surface of a sphere…
]]>Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It’s so common in computer graphics that programmers often use the verb “lerp” to refer to linear interpolation, a function that’s built into all modern graphics hardware (often in multiple hardware units). You can enable linear interpolation (also known as linear filtering) on texture fetches in…
]]>Today I’m excited to announce the official release of CUDA 7, the latest release of the popular CUDA Toolkit. Download the CUDA Toolkit version 7 now from CUDA Zone! CUDA 7 has a huge number of improvements and new features, including C++11 support, the new cuSOLVER library, and support for Runtime Compilation. In a previous post I told you about the features of CUDA 7, so I won’t repeat myself…
]]>Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions concurrently on multiple processors. CUDA Applications manage concurrency by executing asynchronous commands in streams, sequences of commands that execute in order. Different streams may execute their commands concurrently or out of order with…
]]>With the US Department of Energy’s announcement of plans to base two future flagship supercomputers on IBM POWER CPUs, NVIDIA GPUs, NVIDIA NVLink interconnect, and Mellanox high-speed networking, many developers are getting started building GPU-accelerated applications that run on IBM POWER processors. The good news is that porting existing applications to this platform is easy. In fact…
]]>Accelerated systems have become the new standard for high performance computing (HPC) as GPUs continue to raise the bar for both performance and energy efficiency. In 2012, Oak Ridge National Laboratory announced what was to become the world’s fastest supercomputer, Titan, equipped with one NVIDIA® GPU per CPU – over 18 thousand GPU accelerators. Titan established records not only in absolute…
]]>You may already know NVIDIA Tesla as a line of GPU accelerator boards optimized for high-performance, general-purpose computing. They are used for parallel scientific, engineering, and technical computing, and they are designed for deployment in supercomputers, clusters, and workstations. But it’s not just the GPU boards that make Tesla a great computing solution. The combination of the world’s…
]]>Today NVIDIA introduced the new GM204 GPU, based on the Maxwell architecture. GM204 is the first GPU based on second-generation Maxwell, the full realization of the Maxwell architecture. The GeForce GTX 980 and 970 GPUs introduced today are the most advanced gaming and graphics GPUs ever made. But of course they also make fantastic CUDA development GPUs, with full support for CUDA 6.5…
]]>Today we’re excited to announce the release of the CUDA Toolkit version 6.5. CUDA 6.5 adds a number of features and improvements to the CUDA platform, including support for CUDA Fortran in developer tools, user-defined callback functions in cuFFT, new occupancy calculator APIs, and more. Last year we introduced CUDA on Arm, and in March we released the Jetson TK1 developer board…
]]>Unified Memory is a CUDA feature that we’ve talked a lot about on Parallel Forall. CUDA 6 introduced Unified Memory, which dramatically simplifies GPU programming by giving programmers a single pointer to data which is accessible from either the GPU or the CPU. But this enhanced memory model has only been available to CUDA C/C++ programmers, until now. The new PGI Compiler release 14.7…
]]>CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor…
]]>A Givens rotation [1] represents a rotation in a plane represented by a matrix of the form , where the intersections of the th and th columns contain the values and . Multiplying a vector by a Givens rotation matrix represents a rotation of the vector in the plane by radians. According to Wikipedia, the main use of Givens rotations in numerical linear algebra is to introduce zeros in…
]]>Today, cars are learning to see pedestrians and road hazards; robots are becoming higher functioning; complex medical diagnostic devices are becoming more portable; and unmanned aircraft are learning to navigate autonomously. As a result, the computational requirements for these devices are increasing exponentially, while their size, weight, and power limits continue to decrease.
]]>NVIDIA GPU Boost is a feature available on NVIDIA® GeForce® products and NVIDIA® Tesla® products. It makes use of any power headroom to boost application performance. In the case of Tesla, the NVIDIA GPU Boost feature is customized for compute intensive workloads running on clusters. This application note is useful for anyone who wants to take advantage of the power headroom on the Tesla K40 in a…
]]>[Be sure to check out Maxwell: The Most Advanced CUDA GPU Ever Made, a newer post about the second-generation Maxwell GPU architecture.] The introduction this week of NVIDIA’s first-generation “Maxwell” GPUs is a very exciting moment for GPU computing. These first Maxwell products, such as the GeForce GTX 750 Ti, are based on the GM107 GPU and are designed for use in low-power environments such…
]]>When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use shared memory. But the NVIDIA Kepler GPU architecture introduced a way to directly share data between threads that are part of the same warp. On Kepler, threads of a warp can read each others’ registers by using a new instruction called SHFL…
]]>As a CUDA developer, you will often need to control which devices your application uses. In a short-but-sweet post on the Acceleware blog, Chris Mason writes: As Chris points out, robust applications should use the CUDA API to enumerate and select devices with appropriate capabilities at run time. To learn how, read the section on Device Enumeration in the CUDA Programming Guide.
]]>With CUDA 6, NVIDIA introduced one of the most dramatic programming model improvements in the history of the CUDA platform, Unified Memory. In a typical PC or cluster node today, the memories of the CPU and GPU are physically distinct and separated by the PCI-Express bus. Before CUDA 6, that is exactly how the programmer has to view things. Data that is shared between the CPU and GPU must be…
]]>Looking for more? Check out the hands-on DLI training course: Fundamentals of Accelerated Computing with CUDA Python [Note, this post was originally published September 19, 2013. It was updated on September 19, 2017.] Python is a high-productivity dynamic programming language that is widely used in science, engineering, and data analytics applications. There are a number of factors…
]]>NVIDIA GPUs are becoming increasingly popular for large-scale computations in image processing, financial modeling, signal processing, and other applications—largely due to their highly parallel architecture and high computational throughput. The CUDA programming model lets programmers exploit the full power of this architecture by providing fine-grained control over how computations are divided…
]]>There’s a new post over on the NVIDIA Corporate Blog by my colleague Mark Ebersole about the latest line of laptops powered by new GeForce 700-series GPUs. As Mark explains, the GeForce 700 series (GT 730M, GT 735M, and GT 740M), powered by the low-power GK208 GPU has the latest compute features of the Tesla K20 (powered by the GK110 GPU), including: The availability of the latest GPU…
]]>As NVIDIA GPUs evolve to support new features, the instruction set architecture naturally changes. Because applications must run on multiple generations of GPUs, the NVIDIA compiler tool chain supports compiling for multiple architectures in the same application executable or library. CUDA also relies on the PTX virtual GPU ISA to provide forward compatibility, so that already deployed…
]]>NVIDIA’s profiling and tracing tools, including the NVIDIA Visual Profiler, NSight Eclipse and Visual Studio editions, cuda-memcheck, and the nvprof command line profiler are powerful tools that can give you deep insight into the performance and correctness of your GPU-accelerated applications. These tools gather data while your application is running, and use it to create profiles…
]]>In the previous CUDA C++ post we dove in to 3D finite difference computations in CUDA C/C++, demonstrating how to implement the x derivative part of the computation. In this post, let’s continue by exploring how we can write efficient kernels for the y and z derivatives. As with the previous post, code for the examples in this post is available for download on Github. We can easily modify the…
]]>In the previous CUDA C/C++ post we investigated how we can use shared memory to optimize a matrix transpose, achieving roughly an order of magnitude improvement in effective bandwidth by using shared memory to coalesce global memory access. The topic of today’s post is to show how to use shared memory to enhance data reuse in a finite difference code. In addition to shared memory…
]]>My last CUDA C++ post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance gains achievable using shared memory. Specifically, I will optimize a matrix transpose to show how to use shared memory to reorder strided global memory accesses into coalesced accesses. The code we wish to optimize is a transpose of a…
]]>In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is problematic regardless of the generation of the CUDA…
]]>I want to keep this post fairly brief, so I will only give minimal background on floating point numbers. If you need a refresher on floating point representation, I recommend starting with the Wikipedia entry on floating point, and for more detail about NVIDIA GPU floating point, check out this excellent white paper. The Wikipedia entry on denormal numbers is a good start for this post…
]]>In the previous two posts we looked at how to move data efficiently between the host and device. In this sixth post of our CUDA C/C++ series we discuss how to efficiently access device memory, in particular global memory, from within kernels. There are several kinds of memory on a CUDA device, each with different scope, lifetime, and caching behavior. So far in this series we have used global…
]]>In our last CUDA C/C++ post we discussed how to transfer data efficiently between the host and device. In this post, we discuss how to overlap data transfers with computation on the host, computation on the device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other operations requires the use of CUDA streams, so first let’s learn…
]]>In the previous three posts of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this and the following post we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak bandwidth between the device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050…
]]>In this third post of the CUDA C/C++ series, we discuss various characteristics of the wide range of CUDA-capable GPUs, how to query device properties from within a CUDA C/C++ program, and how to handle errors. In our last post, about performance metrics, we discussed how to compute the theoretical peak bandwidth of a GPU. This calculation used the GPU’s memory clock rate and bus interface…
]]>In the first post of this series we looked at the basic elements of CUDA C/C++ by examining a CUDA C/C++ implementation of SAXPY. In this second post we discuss how to analyze the performance of this and other CUDA C/C++ codes. We will rely on these performance measurement techniques in future posts where performance optimization will be increasingly important. CUDA performance measurement is…
]]>This post is the first in a series on CUDA C and C++, which is the C/C++ interface to the CUDA parallel computing platform. This series of posts assumes familiarity with programming in C. We will be running a parallel series of posts about CUDA Fortran targeted at Fortran programmers . These two series will cover the basic concepts of parallel computing on the CUDA platform. From here on unless I…
]]>You may want to read the more recent post Getting Started with OpenACC by Jeff Larkin. In my previous post I added 3 lines of OpenACC directives to a Jacobi iteration code, achieving more than 2x speedup by running it on a GPU. In this post I’ll continue where I left off and demonstrate how we can use OpenACC directives clauses to take more explicit control over how the compiler parallelizes our…
]]>You may want to read the more recent post Getting Started with OpenACC by Jeff Larkin. In this post I’ll continue where I left off in my introductory post about OpenACC and provide a somewhat more realistic example. This simple C/Fortran code example demonstrates a 2x speedup with the addition of just a few lines of OpenACC directives, and in the next post I’ll add just a few more lines to push…
]]>Fresh from the NVIDIA Numeric Libraries Team, a white paper illustrating the use of the CUSPARSE and CUBLAS libraries to achieve a 2x speedup of incomplete-LU- and Cholesky-preconditioned iterative methods. The paper focuses on the Bi-Conjugate Gradient and stabilized Conjugate Gradient iterative methods that can be used to solve large sparse non-symmetric and symmetric positive definite linear…
]]>This post was written by Nathan Whitehead A few days ago, a friend came to me with a question about floating point. Let me start by saying that my friend knows his stuff, he doesn’t ask stupid questions. So he had my attention. He was working on some biosciences simulation code and was getting answers of a different precision than he expected on the GPU and wanted to know what was up.
]]>