Large language models (LLMs) often struggle with accuracy when handling domain-specific questions, especially those requiring multi-hop reasoning or access to proprietary data. While retrieval-augmented generation (RAG) can help, traditional vector search methods often fall short. In this tutorial, we show you how to implement GraphRAG in combination with fine-tuned GNN+LLM models to achieve��
]]>Researchers at Washington State University (WSU) unveiled a new AI-guided 3D printing technique that can help physicians print intricate replicas of human organs. Surgeons can then use these organ models to practice before performing the actual surgery, which gives doctors more tools to improve surgical results. The AI algorithm was trained on images and key attributes of human kidneys and��
]]>The latest release of NVIDIA cuBLAS library, version 12.5, continues to deliver functionality and performance to deep learning (DL) and high-performance computing (HPC) workloads. This post provides an overview of the following updates on cuBLAS matrix multiplications (matmuls) since version 12.0, and a walkthrough: Grouped GEMM APIs can be viewed as a generalization of the batched��
]]>GPU-driven rendering has long been a major goal for many game applications. It enables better scalability for handling large virtual scenes and reduces cases where the CPU could bottleneck a game��s performance. Short of running the game��s logic on the GPU, I see the pinnacle of GPU-driven rendering as a scenario in which the CPU sends the GPU only the new frame��s camera information��
]]>NVIDIA A100 Tensor Core GPUs were featured in a stack that set several records in a recent STAC-A2 benchmark standard based on financial market risk analysis.
]]>The broadcast industry is undergoing a transformation in how content is created, managed, distributed, and consumed. This transformation includes a shift from traditional linear workflows bound by fixed-function devices to flexible and hybrid, software-defined systems that enable the future of live streaming. Developers can now apply to join the early access program for NVIDIA Holoscan for��
]]>Deep learning is achieving significant success in various fields and areas, as it has revolutionized the way we analyze, understand, and manipulate data. There are many success stories in computer vision, natural language processing (NLP), medical diagnosis and health care, autonomous vehicles, recommendation systems, and climate and weather modeling. In an era of ever-growing neural network��
]]>The pace of 5G investment and adoption is accelerating. According to the GSMA Mobile Economy 2023 report, nearly $1.4 trillion will be spent on 5G CapEx, between 2023 and 2030. Radio access network (RAN) may account for over 60% of the spend. Increasingly, the CapEx spend is moving from the traditional approach with proprietary hardware, to virtualized RAN (vRAN) and Open RAN architectures��
]]>Lowering response times to new market events is a driving force in algorithmic trading. Latency-sensitive trading firms keep up with the ever-increasing pace of financial electronic markets by deploying low-level hardware devices like Field Programmable Gate Arrays (FPGAs) and Application Specific Integrated Circuits (ASICs) into their systems. However, as markets become increasingly��
]]>Multi-Instance GPU (MIG) is an important feature of NVIDIA H100, A100, and A30 Tensor Core GPUs, as it can partition a GPU into multiple instances. Each instance has its own compute cores, high-bandwidth memory, L2 cache, DRAM bandwidth, and media engines such as decoders. This enables multiple workloads or multiple users to run workloads simultaneously on one GPU to maximize the GPU��
]]>Join the NVIDIA Triton and NVIDIA TensorRT community to stay current on the latest product updates, bug fixes, content, best practices, and more. We��re excited to announce the NVIDIA Quantization-Aware Training (QAT) Toolkit for TensorFlow 2 with the goal of accelerating the quantized networks with NVIDIA TensorRT on NVIDIA GPUs. This toolkit provides you with an easy-to-use API to quantize��
]]>High-performance computing (HPC) has become the essential instrument of scientific discovery. Whether it is discovering new, life-saving drugs, battling climate change, or creating accurate simulations of our world, these solutions demand an enormous��and rapidly growing��amount of processing power. They are increasingly out of reach of traditional computing approaches.
]]>Recent work has demonstrated that large transformer models can achieve or advance the SOTA in computer vision tasks such as semantic segmentation and object detection. However, unlike convolutional network models that can do it only with the standard public dataset, it takes a proprietary dataset that is magnitudes larger. The recent project VOLO (Vision Outlooker) from SEA AI Lab��
]]>NVIDIA A30 GPU is built on the latest NVIDIA Ampere Architecture to accelerate diverse workloads like AI inference at scale, enterprise training, and HPC applications for mainstream servers in data centers. The A30 PCIe card combines the third-generation Tensor Cores with large HBM2 memory (24 GB) and fast GPU memory bandwidth (933 GB/s) in a low-power envelope (maximum 165 W).
]]>Join NVIDIA experts and Metropolis partners on Sept. 22 for webinars exploring developer SDKs, GPUs, go-to-market opportunities, and more. All three sessions, each with unique speakers and content, will be recorded and will be available for on-demand viewing later. Register Now >> Wednesday, September 22, 2021, 1 PM PDT Wednesday, September 22, 2021��
]]>Join the NVIDIA Triton and NVIDIA TensorRT community to stay current on the latest product updates, bug fixes, content, best practices, and more. As of 3/18/25, NVIDIA Triton Inference Server is now NVIDIA Dynamo. NVIDIA Triton Inference Server is an open-source AI model serving software that simplifies the deployment of trained AI models at scale in production.
]]>NVIDIA announces the newest release of the CUDA development environment, CUDA 11.4. This release includes GPU-accelerated libraries, debugging and optimization tools, programming language enhancements, and a runtime library to build and deploy your application on GPUs across the major CPU architectures: x86, Arm, and POWER. CUDA 11.4 is focused on enhancing the programming model and��
]]>This post was originally published in August 2019 and has been updated for NVIDIA TensorRT 8.0. Join the NVIDIA Triton and NVIDIA TensorRT community to stay current on the latest product updates, bug fixes, content, best practices, and more. Large-scale language models (LSLMs) such as BERT, GPT-2, and XL-Net have brought exciting leaps in accuracy for many natural language processing��
]]>Tensor Cores, which are programmable matrix multiply and accumulate units, were first introduced in the V100 GPUs where they operated on half-precision (16-bit) multiplicands. Tensor Core functionality has been expanded in the following architectures, and in the Ampere A100 GPUs (compute capability 8.0) support for other data types was added, including double precision.
]]>Sparse-matrix dense-matrix multiplication (SpMM) is a fundamental linear algebra operation and a building block for more complex algorithms such as finding the solutions of linear systems, computing eigenvalues through the preconditioned conjugate gradient, and multiple right-hand sides Krylov subspace iterative solvers. SpMM is also an important kernel used in many domains such as fluid dynamics��
]]>NVIDIA Ampere GPU architecture introduced the third generation of Tensor Cores, with the new TensorFloat32 (TF32) mode for accelerating FP32 convolutions and matrix multiplications. TF32 mode is the default option for AI training with 32-bit variables on Ampere GPU architecture. It brings Tensor Core acceleration to single-precision DL workloads, without needing any changes to model scripts.
]]>Editor��s note: Interested in GPU Operator? Register for our upcoming webinar on January 20th, ��How to Easily use GPUs with Kubernetes��. Reliably provisioning servers with GPUs can quickly become complex as multiple components must be installed and managed to use GPUs with Kubernetes. The GPU Operator simplifies the initial deployment and management and is based on the Operator Framework.
]]>Recently, NVIDIA unveiled the A100 GPU model, based on the NVIDIA Ampere architecture. Ampere introduced many features, including Multi-Instance GPU (MIG), that play a special role for deep learning-based (DL) applications. MIG makes it possible to use a single A100 GPU as if it were multiple smaller GPUs, maximizing utilization for DL workloads and providing dynamic scalability.
]]>CUDA is the software development platform for building GPU-accelerated applications, providing all the components needed to develop applications targeting every NVIDIA GPU platform for general purpose compute acceleration. The latest CUDA release, CUDA 11.2, is focused on improving the user experience and application performance for CUDA developers. CUDA 11.2��
]]>Exploding model sizes in deep learning and AI, complex simulations in high-performance computing (HPC), and massive datasets in data analytics all continue to demand faster and more advanced GPUs and platforms. At SC20, we announced the NVIDIA A100 80GB GPU, the latest addition to the NVIDIA Ampere family, to help developers, researchers, and scientists tackle their toughest challenges.
]]>Deep neural networks achieve outstanding performance in a variety of fields, such as computer vision, speech recognition, and natural language processing. The computational power needed to process these neural networks is rapidly increasing, so efficient models and computation are crucial. Neural network pruning, removing unnecessary model parameters to yield a sparse network, is a useful way to��
]]>With the third-generation Tensor Core technology, NVIDIA recently unveiled A100 Tensor Core GPU that delivers unprecedented acceleration at every scale for AI, data analytics, and high-performance computing. Along with the great performance increase over prior generation GPUs comes another groundbreaking innovation, Multi-Instance GPU (MIG). With MIG, each A100 GPU can be partitioned up to seven��
]]>The NVIDIA A100 brought the biggest single-generation performance gains ever in our company��s history. These speedups are a product of architectural innovations that include Multi-Instance GPU (MIG), support for accelerated structural sparsity, and a new precision called TF32, which is the focus of this post. TF32 is a great precision to use for deep learning training, as it combines the range of��
]]>NVIDIA this week announced that the Italian inter-university consortium CINECA �� one of the world��s most important supercomputing centers �� will use the company��s accelerated computing platform to build the world��s fastest AI supercomputer. The new ��Leonardo�� system, built with Atos, is expected to deliver 10 exaflops of FP16 AI performance to enable advanced AI and HPC converged application��
]]>The NVIDIA Ampere architecture provides new mechanisms to control data movement within the GPU and CUDA 11.1 puts those controls into your hands. These mechanisms include asynchronously copying data into shared memory and influencing the residency of data in the L2 cache. This post walks through how to use the asynchronous copy feature, and how to set up your algorithms to overlap��
]]>Starting on October 5-9, This fall��s GTC will run continuously for five days, across seven time zones. The conference will showcase the latest breakthroughs in HPC, and many other GPU technology interest areas. Attend live events in the time zone that works best for you, or browse an extensive catalog of on-demand content showcasing innovative uses of HPC technology. Here��s a preview of��
]]>The NVIDIA A100, based on the NVIDIA Ampere GPU architecture, offers a suite of exciting new features: third-generation Tensor Cores, Multi-Instance GPU (MIG) and third-generation NVLink. Ampere Tensor Cores introduce a novel math mode dedicated for AI training: the TensorFloat-32 (TF32). TF32 is designed to accelerate the processing of FP32 data types, commonly used in DL workloads.
]]>During the 2020 NVIDIA GPU Technology Conference keynote address, NVIDIA founder and CEO Jensen Huang introduced the new NVIDIA A100 GPU based on the NVIDIA Ampere GPU architecture. In this post, we detail the exciting new features of the A100 that make NVIDIA GPUs an ever-better powerhouse for computer vision workloads. We also showcase two recent CV research projects from NVIDIA Research��
]]>Today, smartphones, the most popular device for taking pictures, can capture images as large as 4K UHD (3840��2160 image), more than 25 MB of raw data. Even considering the embarrassingly low HD resolution (1280��720), a raw image requires more than 2.5 MB of storage. Storing as few as 100 UHD images would require almost 3 GB of free space. Clearly, if you store data this way��
]]>According to surveys, the average person produces 1.2 trillion images that are captured by either a phone or a digital camera. The storage of such images, especially in high-resolution raw format, uses lots of memory. JPEG refers to the Joint Photographic Experts Group, which celebrated its 25th birthday in 2017. The JPEG standard specifies the codec, which defines how an image is compressed��
]]>NVIDIA today introduced the first GPU based on the NVIDIA Ampere architecture, the NVIDIA A100, is in full production and shipping to customers worldwide. The A100 draws on design breakthroughs in the NVIDIA Ampere architecture �� offering the company��s largest leap in performance to date within its eight generations of GPUs �� to unify AI training and inference and boost performance by up to��
]]>Recent work has demonstrated that larger language models dramatically advance the state of the art in natural language processing (NLP) applications such as question-answering, dialog systems, summarization, and article completion. However, during training, large models do not fit in the available memory of a single accelerator, requiring model parallelism to split the parameters across multiple��
]]>The NVIDIA mission is to accelerate the work of the da Vincis and Einsteins of our time. Scientists, researchers, and engineers are focused on solving some of the world��s most important scientific, industrial, and big data challenges using artificial intelligence (AI) and high performance computing (HPC). The NVIDIA HGX A100 with A100 Tensor Core GPUs delivers the next giant leap in our��
]]>Today, during the 2020 NVIDIA GTC keynote address, NVIDIA founder and CEO Jensen Huang introduced the new NVIDIA A100 GPU based on the new NVIDIA Ampere GPU architecture. This post gives you a look inside the new A100 GPU, and describes important new features of NVIDIA Ampere architecture GPUs. The diversity of compute-intensive applications running in modern cloud data centers has driven��
]]>The new NVIDIA A100 GPU based on the NVIDIA Ampere GPU architecture delivers the greatest generational leap in accelerated computing. The A100 GPU has revolutionary hardware capabilities and we��re excited to announce CUDA 11 in conjunction with A100. CUDA 11 enables you to leverage the new hardware capabilities to accelerate HPC, genomics, 5G, rendering, deep learning, data analytics��
]]>The NVIDIA Ampere GPU architecture has arrived! It��s time to make sure that your applications are getting the most out of the powerful compute resources in this new architecture. With the release of CUDA 11, we are adding several features to the Nsight family of Developer Tools to help you do just that. These additions improve usability, productivity, and make it easier for you to find bugs��
]]>