NVIDIA DGX Cloud Serverless Inference is an auto-scaling AI inference solution that enables application deployment with speed and reliability. Powered by NVIDIA Cloud Functions (NVCF), DGX Cloud Serverless Inference abstracts multi-cluster infrastructure setups across multi-cloud and on-premises environments for GPU-accelerated workloads. Whether managing AI workloads��
]]>As AI capabilities advance, understanding the impact of hardware and software infrastructure choices on workload performance is crucial for both technical validation and business planning. Organizations need a better way to assess real-world, end-to-end AI workload performance and the total cost of ownership rather than just comparing raw FLOPs or hourly cost per GPU.
]]>Organizations are embracing AI agents to enhance productivity and streamline operations. To maximize their impact, these agents need strong reasoning abilities to navigate complex problems, uncover hidden connections, and make logical decisions autonomously in dynamic environments. Due to their ability to tackle complex problems, reasoning models have become a key part of the agentic AI��
]]>Large language models (LLMs) have shown remarkable generalization capabilities in natural language processing (NLP). They are used in a wide range of applications, including translation, digital assistants, recommendation systems, context analysis, code generation, cybersecurity, and more. In automotive applications, there is growing demand for LLM-based solutions for both autonomous driving and��
]]>Training AI models on massive GPU clusters presents significant challenges for model builders. Because manual intervention becomes impractical as job scale increases, automation is critical to maintaining high GPU utilization and training productivity. An exceptional training experience requires resilient systems that provide low-latency error attribution and automatic fail over based on root��
]]>According to the World Health Organization (WHO), 3.6 billion medical imaging tests are performed every year globally to diagnose, monitor, and treat various conditions. Most of these images are stored in a globally recognized standard called DICOM (Digital Imaging and Communications in Medicine). Imaging studies in DICOM format are a combination of unstructured images and structured metadata.
]]>NVIDIA Enterprise Reference Architectures (Enterprise RAs) can reduce the time and cost of deploying AI infrastructure solutions. They provide a streamlined approach for building flexible and cost-effective accelerated infrastructure while ensuring compatibility and interoperability. The latest Enterprise RA details an optimized cluster configuration for systems integrated with NVIDIA GH200��
]]>In data science, operational efficiency is key to handling increasingly complex and large datasets. GPU acceleration has become essential for modern workflows, offering significant performance improvements. RAPIDS is a suite of open-source libraries and frameworks developed by NVIDIA, designed to accelerate data science pipelines using GPUs with minimal code changes.
]]>Hardware support for ray tracing triangle meshes was introduced as part of NVIDIA RTX in 2018. But ray tracing for hair and fur has remained a compute-intensive problem that has been difficult to further accelerate. That is, until now. NVIDIA GeForce 50 Series GPUs include a major advancement in the acceleration of ray tracing for hair and fur: hardware ray tracing support for the linear��
]]>Geometric detail in computer graphics has increased exponentially in the past 30 years. To render high quality assets with higher instance counts and greater triangle density, NVIDIA introduced RTX Mega Geometry. RTX Mega Geometry is available today through NVIDIA RTX Kit, a suite of rendering technologies to ray trace games with AI, render scenes with immense geometry, and create game characters��
]]>NVIDIA AI Workbench is a free development environment manager to develop, customize, and prototype AI applications on your GPUs. AI Workbench provides a frictionless experience across PCs, workstations, servers, and cloud for AI, data science, and machine learning (ML) projects. The user experience includes: This post provides details about the January 2025 release of NVIDIA AI Workbench��
]]>The next generation of NVIDIA graphics hardware has arrived. Powered by NVIDIA Blackwell, GeForce RTX 50 Series GPUs deliver groundbreaking new RTX features such as DLSS 4 with Multi Frame Generation, and NVIDIA RTX Kit with RTX Mega Geometry and RTX Neural Shaders. NVIDIA RTX Blackwell architecture introduces fifth-generation Tensor Cores to drive AI workloads and fourth-generation RT Cores with��
]]>Evaluating large language models (LLMs) and retrieval-augmented generation (RAG) systems is a complex and nuanced process, reflecting the sophisticated and multifaceted nature of these systems. Unlike traditional machine learning (ML) models, LLMs generate a wide range of diverse and often unpredictable outputs, making standard evaluation metrics insufficient. Key challenges include the��
]]>As of 3/18/25, NVIDIA Triton Inference Server is now NVIDIA Dynamo. The explosion of AI-driven applications has placed unprecedented demands on both developers, who must balance delivering cutting-edge performance with managing operational complexity and cost, and AI infrastructure. NVIDIA is empowering developers with full-stack innovations��spanning chips, systems��
]]>At NVIDIA, the Sales Operations team equips the Sales team with the tools and resources needed to bring cutting-edge hardware and software to market. Managing this across NVIDIA��s diverse technology is a complex challenge shared by many enterprises. Through collaboration with our Sales team, we found that they rely on internal and external documentation��
]]>Language models generate text by predicting the next token, given all the previous tokens including the input text tokens. Key and value elements of the previous tokens are used as historical context in LLM serving for generation of the next set of tokens. Caching these key and value elements from previous tokens avoids expensive recomputation and effectively leads to higher throughput. However��
]]>Generative AI has revolutionized how people bring ideas to life, and agentic AI represents the next leap forward in this technological evolution. By leveraging sophisticated, autonomous reasoning and iterative planning, AI agents can tackle complex, multistep problems with remarkable efficiency. As AI continues to revolutionize industries, the demand for running AI models locally has surged.
]]>RAPIDS is a suite of open-source GPU-accelerated data science and AI libraries that are well supported for scale-out with distributed engines like Spark and Dask. Ray is a popular open-source distributed Python framework commonly used to scale AI and machine learning (ML) applications. Ray particularly excels at simplifying and scaling training and inference pipelines and can easily target both��
]]>Agentic AI workflows often involve the execution of large language model (LLM)-generated code to perform tasks like creating data visualizations. However, this code should be sanitized and executed in a safe environment to mitigate risks from prompt injection and errors in the returned code. Sanitizing Python with regular expressions and restricted runtimes is insufficient��
]]>WEKA, a pioneer in scalable software-defined data platforms, and NVIDIA are collaborating to unite WEKA��s state-of-the-art data platform solutions with powerful NVIDIA BlueField DPUs. The WEKA Data Platform advanced storage software unlocks the full potential of AI and performance-intensive workloads, while NVIDIA BlueField DPUs revolutionize data access, movement, and security.
]]>One of the great pastimes of graphics developers and enthusiasts is comparing specifications of GPUs and marveling at the ever-increasing counts of shader cores, RT cores, teraflops, and overall computational power with each new generation. Achieving the maximum theoretical performance represented by those numbers is a major focus in the world of graphics programming. Massive amounts of rendering��
]]>As we move towards a more dense computing infrastructure, with more compute, more GPUs, accelerated networking, and so forth��multi-gpu training and analysis grows in popularity. We need tools and also best practices as developers and practitioners move from CPU to GPU clusters. RAPIDS is a suite of open-source GPU-accelerated data science and AI libraries. These libraries can easily scale-out for��
]]>AI models for science are often trained to make predictions about the workings of nature, such as predicting the structure of a biomolecule or the properties of a new solid that can become the next battery material. These tasks require high precision and accuracy. What makes AI for science even more challenging is that highly accurate and precise scientific data is often scarce��
]]>NVIDIA AI Workbench is a free development environment manager that streamlines data science, AI, and machine learning (ML) projects on systems of choice. The goal is to provide a frictionless way to create, compute, and collaborate on and across PCs, workstations, data centers, and clouds. The basic user experience is straightforward: This post explores highlights of the October release��
]]>NVIDIA technology helps organizations build and maintain secure, scalable, and high-performance network infrastructure. Advances in AI, with NVIDIA at the forefront, contribute every day to security advances. One way NVIDIA has taken a more direct approach to network security is through a secure network operating system (NOS). A secure network operating system (NOS) is a specialized type of��
]]>There��s never enough time to do everything, even in engineering education. Employers want engineers capable of wielding simulation tools to expedite iterative research, design, and development. Some instructors try to address this by teaching for weeks or months, on derivations of numerical methods, approaches to discretization, the intricacies of turbulence models, and more. Unfortunately��
]]>The NVIDIA RTX AI for Windows PCs platform offers a thriving ecosystem of thousands of open-source models for application developers to leverage and integrate into Windows applications. Notably, llama.cpp is one popular tool, with over 65K GitHub stars at the time of writing. Originally released in 2023, this open-source repository is a lightweight, efficient framework for large language model��
]]>As large language models (LLMs) continue to evolve at an unprecedented pace, enterprises are looking to build generative AI-powered applications that maximize throughput to lower operational costs and minimize latency to deliver superior user experiences. This post discusses the critical performance metrics of throughput and latency for LLMs, exploring their importance and trade-offs between��
]]>Shaders are specialized programs that run on the GPU that manipulate rays, pixels, vertices, and textures to achieve unique visual effects. With shaders, you can add creative expression and realism to the rendered image. They��re essential in ray tracing for simulating realistic lighting, shadows, and reflections. We love shaders, but they can be hard to debug. Shader calculations are complex��
]]>Developers from advertising agencies to software vendors are empowering global brands to deliver hyperpersonalization for digital experiences and visual storytelling with product configurator solutions. Integrating NVIDIA Omniverse with OpenUSD and generative AI into product configurators enables solution providers and software developers to deliver interactive, ray-traced��
]]>General-purpose large language models (LLMs) have proven their usefulness across various fields, offering substantial benefits in applications ranging from text generation to complex problem-solving. However, there are circumstances where developing a bespoke language model becomes not just beneficial but essential. This necessity arises particularly in specialized domains characterized by��
]]>This post is part of the NVIDIA AI Red Team��s continuing vulnerability and technique research. Use the concepts presented to responsibly assess and increase the security of your AI development and deployment processes and applications. Large language models (LLMs) don��t operate over strings. Instead, prompts are passed through an often-transparent translator called a tokenizer that creates an��
]]>The latest release of NVIDIA cuBLAS library, version 12.5, continues to deliver functionality and performance to deep learning (DL) and high-performance computing (HPC) workloads. This post provides an overview of the following updates on cuBLAS matrix multiplications (matmuls) since version 12.0, and a walkthrough: Grouped GEMM APIs can be viewed as a generalization of the batched��
]]>Ready to move your pilot to production? Get an expert overview on how to deploy generative AI applications.
]]>Retrieval-augmented generation (RAG) is a technique that combines information retrieval with a set of carefully designed system prompts to provide more accurate, up-to-date, and contextually relevant responses from large language models (LLMs). By incorporating data from various sources such as relational databases, unstructured document repositories, internet data streams, and media news feeds��
]]>At GTC 2024, experts from NVIDIA and our partners shared insights about GPU-accelerated tools, optimizations, and best practices for data scientists. From the hundreds of sessions covering various topics, we��ve handpicked the top three data science sessions that you won��t want to miss. RAPIDS in 2024: Accelerated Data Science Everywhere Speakers: Dante Gama Dessavre��
]]>Many PC games are designed around an eight-core console with an assumption that their software threading system ��just works�� on all PCs, especially regarding the number of threads in the worker thread pool. This was a reasonable assumption not too long ago when most PCs had similar core counts to consoles: the CPUs were just faster and performance just scaled. In recent years though��
]]>A common technological misconception is that performance and complexity are directly linked. That is, the highest-performance implementation is also the most challenging to implement and manage. When considering data center networking, however, this is not the case. InfiniBand is a protocol that sounds daunting and exotic in comparison to Ethernet, but because it is built from the ground up��
]]>Many CUDA applications running on multi-GPU platforms usually use a single GPU for their compute needs. In such scenarios, a performance penalty is paid by applications because CUDA has to enumerate/initialize all the GPUs on the system. If a CUDA application does not require other GPUs to be visible and accessible, you can launch such applications by isolating the unwanted GPUs from the CUDA��
]]>Swap chains are an integral part of how you get rendering data output to a screen. They usually consist of some group of output-ready buffers, each of which can be rendered to one at a time in rotation. In parallel with rendering to one of a swap chain��s buffers, some other buffer in the swap chain is generally read from for display output. This post covers best practices when working with��
]]>Intrinsics can be thought of as higher-level abstractions of specific hardware instructions. They offer direct access to low-level operations or hardware-specific features, enabling increased performance. In this way, operations can be performed across threads within a warp, also known as a wavefront. The following code example is an example with��
]]>Large language models (LLMs) provide a wide range of powerful enhancements to nearly any application that processes text. And yet they also introduce new risks, including: This post walks through these security vulnerabilities in detail and outlines best practices for designing or evaluating a secure LLM-enabled application. Prompt injection is the most common and well-known��
]]>Diamond Light Source is a world-renowned synchrotron facility in the UK that provides scientists with access to intense beams of x-rays, infrared, and other forms of light to study materials and biological structures. The facility boasts over 30 experimental stations or beamlines, and is home to some of the most advanced and complex scientific research projects in the world. I08-1��
]]>By using descriptor types, you can bind resources to shaders and specify how those resources are accessed. This creates efficient communication between the CPU and GPU and enables shaders to access the necessary data during rendering.
]]>Data is the lifeblood of AI systems, which rely on robust datasets to learn and make predictions or decisions. For perception AI models specifically, it is essential that data reflects real-world environments and incorporates the array of scenarios. This includes edge use cases for which data is often difficult to collect, such as street traffic and manufacturing assembly lines.
]]>Traditional cloud data centers have served as the bedrock of computing infrastructure for over a decade, catering to a diverse range of users and applications. However, data centers have evolved in recent years to keep up with advancements in technology and the surging demand for AI-driven computing. This post explores the pivotal role that networking plays in shaping the future of data centers��
]]>The NVIDIA AI Red Team is focused on scaling secure development practices across the data, science, and AI ecosystems. We participate in open-source security initiatives, release tools, present at industry conferences, host educational competitions, and provide innovative training. Covering 3 years and totaling almost 140GB of source code, the recently released Meta Kaggle for Code dataset is��
]]>In today��s data center, there are many ways to achieve system redundancy from a server connected to a fabric. Customers usually seek redundancy to increase service availability (such as achieving end-to-end AI workloads) and find system efficiency using different multihoming techniques. In this post, we discuss the pros and cons of the well-known proprietary multi-chassis link aggregation��
]]>NVIDIA has already made available a GPU driver binary symbols server for Windows. Now, NVIDIA is making available a repository of CUDA Toolkit symbols for Linux. NVIDIA is introducing CUDA Toolkit symbols for Linux for an application development enhancement. During application development, you can now download obfuscated symbols for NVIDIA libraries that are being debugged or profiled in��
]]>This post covers best practices when working with shaders on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. Shaders play a critical role in graphics programming by enabling you to control various aspects of the rendering process. They run on the GPU and are responsible for manipulating vertices, pixels, and other data.
]]>Picture this: You��re browsing through an online store, looking for the perfect pair of running shoes. But with thousands of options available, where do you even begin? Suddenly, a section catches your eye: ��Recommended for You.�� Intrigued, you click and, within seconds, a curated list of running shoes tailored to your unique preferences appears. It��s as if the website understands your tastes��
]]>This post covers best practices when working with pipeline state objects on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. Pipeline state objects (PSOs) define how input data is interpreted and rendered by the hardware when submitting work to the GPUs. Proper management of PSOs is essential for optimal usage of system��
]]>We were stuck. Really stuck. With a hard delivery deadline looming, our team needed to figure out how to process a complex extract-transform-load (ETL) job on trillions of point-of-sale transaction records in a few hours. The results of this job would feed a series of downstream machine learning (ML) models that would make critical retail assortment allocation decisions for a global retailer.
]]>If you are looking to take your machine learning (ML) projects to new levels of speed and scalability, GPU-accelerated data analytics can help you deliver insights quickly with breakthrough performance. From faster computation to efficient model training, GPUs bring many benefits to everyday ML tasks. Update: The below blog describes how to use GPU-only RAPIDS cuDF��
]]>If you are a DirectX 12 (DX12) game developer, you may have noticed that GPU times displayed in real time in your game HUD may change over time for a given pass. This may be the case even if nothing has changed on the application side. One reason for GPU time variations may be GPU Boost dynamically changing the GPU core clock frequency. Still, even with GPU Boost disabled using the DX12��
]]>This post covers CPU best practices when working with NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. To get the best performance from your NVIDIA GPU, pair it with efficient work delegation on the CPU. Frame-rate caps, stutter, and other subpar application performance events can often be traced back to a bottleneck on the CPU.
]]>This post covers best practices for using sampler feedback on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. Sampler feedback is a DirectX 12 Ultimate feature for capturing and recording texture sampling information and locations. Sampler feedback was designed to provide better support for streaming and texture-space shading.
]]>There are many benefits of GPUs in scaling AI, ranging from faster model training to GPU-accelerated fraud detection. While planning AI models and deployed apps, scalability challenges��especially performance and storage��must be accounted for. Regardless of the use case, AI solutions have four elements in common: Of these elements, data storage is often the most neglected during��
]]>Software teams comprise a broad range of professionals, from software engineers and data scientists to project managers and technical writers. Sharing code with other team members is common when working on a project, and it is important to track all changes. This is where pull requests come in. In software development, a pull request is used to push local changes into a shared repository��
]]>Storytelling with data is a crucial soft skill for AI and data professionals. To ensure that stakeholders understand the technical requirements, value, and impact of data science team efforts, it is necessary for data scientists, data engineers, and machine learning (ML) engineers to communicate effectively. This post provides a framework and tips you can adopt to incorporate key elements of��
]]>This post is an update of Best Practices: Using NVIDIA RTX Ray Tracing. This post gathers best practices based on our experiences so far using NVIDIA RTX ray tracing in games. The practical tips are organized into short, actionable items for developers working on ray tracing today. They aim to provide insight into what kind of solutions lead to good performance in most cases.
]]>This post covers best practices for Vulkan clearing and presenting on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. With the recent Vulkan 1.3 release, it��s timely to add some Vulkan-specific tips that are not necessarily explicitly covered by the other Advanced API Performance posts. In addition to introducing new Vulkan 1.3��
]]>This post covers best practices for using SetStablePowerState on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. Most modern processors, including GPUs, change processor core and memory clock rates during application execution. These changes can vary performance, introducing errors in measurements and rendering comparisons��
]]>This post covers best practices for variable rate shading on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. Variable rate shading (VRS) is a graphics feature allowing applications to control the frequency of pixel shader invocations independent of the resolution of the render target. It is available in both D3D12 and Vulkan.
]]>This post covers best practices for clears on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. Surface clearing is a widely used accessory operation. Thanks to Michael Murphy, Maurice Harris, Dmitry Zhdan, and Patric Neil for their advice and feedback.
]]>This post covers best practices for mesh shaders on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. Mesh shaders are a recent addition to the programmatical pipeline and aim to overcome the bottlenecks of the fixed layout used by the classical geometry pipeline. This post covers best practices for both DirectX and Vulkan��
]]>This post covers best practices for memory and resources on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. Optimal memory management in DirectX 12 is critical to a performant application. The following advice should be followed for the best performance while avoiding stuttering.
]]>This post covers best practices for command buffers on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. Command buffers are the main mechanism for sending commands from the CPU to be executed on the GPU. By following the best practices listed in this post, you can achieve performance gains on both the CPU and the GPU by��
]]>This post covers best practices for barriers on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. For the best performance on our hardware, here��s what you should and shouldn��t do when you��re using barriers with DX12 or Vulkan. This is updated from DX12 Do��s And Don��ts.
]]>This post covers best practices for async copy on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. Async copy runs on completely independent hardware but you have to schedule it onto the separate queue. You can consider turning an async copy into an async compute as a performance strategy. NVIDIA has a dedicated async copy��
]]>This post covers best practices for async compute and overlap on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips. The general principle behind async compute is to increase the overall unit throughput by reducing the number of unused warp slots and to facilitate the simultaneous use of nonconflicting datapaths.
]]>The growth of edge computing has been a hot topic in many industries. The value of smart infrastructure can mean improvements to overall operational efficiency, safety, and even the bottom line. However, not all workloads need to be or even should be, deployed at the edge. Enterprises use a combination of edge computing and cloud computing when developing and deploying AI applications.
]]>In ray tracing, more geometries can reside in the GPU memory than with the rasterization approach because rays may hit the geometries out of the view frustum. You can let the GPU compact acceleration structures to save memory usage. For some games, compaction reduces the memory footprint for a bottom-level acceleration structure (BLAS) by at least 50%. BLASes usually take more GPU memory than top��
]]>A leading global retailer has invested heavily in becoming one of the most competitive technology companies around. Accurate and timely demand forecasting for millions of item-by-store combinations is critical to serving their millions of weekly customers. Key to their success in forecasting is RAPIDS, an open-source suite of GPU-accelerated libraries. RAPIDS helps them tear through their��
]]>DLSS is a deep learning, super-resolution network that boosts frame rates by rendering fewer pixels and then using AI to construct sharp, higher-resolution images. Dedicated computational units on NVIDIA RTX GPUs called Tensor Cores accelerate the AI calculations, allowing the algorithm to run in real time. DLSS pairs perfectly with computationally intensive rendering algorithms such as real-time��
]]>This post has been updated: Best Practices for Using NVIDIA RTX Ray Tracing (Updated). This post gathers best practices based on our experiences so far on using NVIDIA RTX ray tracing in games. I��ve organized the tips into short, actionable items that give practical tips for developers working on ray tracing today. They aim to give a broad picture of what kind of solutions lead to good��
]]>Roughly five months ago, we introduced you to the new ray tracing support (via DirectX Raytracing) in the 4.22 release of Unreal Engine. Recently, Epic Games released version 4.23 which brings a number of upgrades for those working with ray tracing. Even better, many of these new and improved features, such as enhancements to performance, quality, and stability, require no direct user effort.
]]>Our most popular question is ��What can I do to get great GPU performance for deep learning?�� We��ve recently published a detailed Deep Learning Performance Guide to help answer this question. The guide explains how GPUs process data and gives tips on how to design networks for better performance. We also take a close look at Tensor Core optimization to help improve performance. This post takes a��
]]>Note: This post was updated on 1/14/2025 to reflect updates. The increased performance potential of modern graphics APIs is coupled with a dramatically increased level of developer responsibility. Optimal use of Vulkan is not a trivial concept, especially in the context of a large engine, and information about how to maximize performance is still somewhat sparse.
]]>This post presents best practices for implementing ray tracing in games and other real-time graphics applications. We present these as briefly as possible to help you quickly find key ideas. This is based on a presentation made at the 2019 GDC by NVIDIA engineers. 1.1 General Practices Move AS management (build/update) to an async compute queue. Using an async compute queue pairs��
]]>Post updated on December 10, 2024. NVIDIA has deprecated nvprof and NVIDIA Visual Profiler and these tools are not supported on current GPU architectures. The original post still applies to previous GPU architectures, up to and including Volta. For Volta and newer architectures, profile your applications with NVIDIA Nsight Compute and NVIDIA Nsight Systems. For more information about how to��
]]>The last time you used the timeline feature in the NVIDIA Visual Profiler, Nsight VSE or the new Nsight Systems to analyze a complex application, you might have wished to see a bit more than just CUDA API calls and GPU kernels. In this post I will show you how you can use the NVIDIA Tools Extension (NVTX) to annotate the time line with useful information. I will demonstrate how to add time��
]]>