NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance

NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over 250 tokens per second per user or a maximum throughput of over 30,000 tokens per second on the massive, state-of-the-art 671 billion parameter DeepSeek-R1 model. These rapid advancements in performance at both ends of the performance spectrum were made possible by improvements to the NVIDIA open ecosystem of inference developer tools, now optimized for the NVIDIA Blackwell architecture.

These performance records will improve as the NVIDIA platform continues to push the limits of inference on the latest NVIDIA Blackwell Ultra GPUs and NVIDIA Blackwell GPUs.

Bar chart showing NVIDIA B200 GPUs in an NVL8 configuration, running TensorRT-LLM software, deliver the highest published tokens per second per user on the full DeepSeek-R1 671B model. — *Figure 1. NVIDIA B200 GPUs in an NVL8 configuration, running TensorRT-LLM software, deliver the highest published tokens per second per user on the full DeepSeek-R1 671B model*

Single DGX B200 8-GPU system, and single DGX H200 8 GPU system | B200 and H200 March and February numbers running on an internal version of TensorRT-LLM | March, Input 1,024 tokens, output 2,048 tokens, February and January 1,024 input tokens, 1,024 output tokens | Concurrency 1 | B200 FP4, H100 and H200 FP8 precision.

Bar chart showing that NVIDIA has increased throughput on the DeepSeek-R1 671B model by about 36x since January 2025, translating into about a 32x improvement in cost per token. — Figure 2. Through a combination of hardware and software, NVIDIA has increased throughput on the DeepSeek-R1 671B model by about 36x since January 2025, translating into about a 32x improvement in cost per token

Maximum throughput normalized to 8-GPU system | single DGX B200 8 GPU system, single DGX H200 8 GPU system, two DGX H100 8 GPU systems normalized | Internal version of TensorRT-LLM | March, Input 1,024 tokens, output 2,048 tokens, February and January 1,024 input tokens, 1,024 output tokens | Concurrency MAX | B200 FP4, H200/H100 FP8 precision.

The NVIDIA inference ecosystem is the largest in the world. It enables developers to build solutions tailored for their deployment requirements, whether they are targeting maximum user experience or maximum efficiency. It includes open source tools directly from NVIDIA, as well as the community, which take advantage of the latest Blackwell architecture and software advancements.

These Blackwell advancements include up to 5x more AI compute with fifth-generation Tensor Cores with FP4 acceleration, 2x the NVIDIA NVLink bandwidth compared to the prior generation with fifth-generation NVLink and NVLink Switch, and scalability to much larger NVLink domains. These performance improvements, both on a per-chip basis and at data center scale, are critical enablers of high throughput, low latency inference for cutting-edge LLMs, like DeepSeek-R1.

Accelerated computing requires more than a powerful hardware infrastructure. An optimized and rapidly evolving software stack is required to deliver optimal workload performance both on today’s most demanding workloads and to be ready to serve new, even more challenging workloads as they emerge. NVIDIA continues to optimize every layer of the technology stack—chips, systems, libraries, algorithms, and more—to deliver exceptional workload performance.

This post provides an overview of the many updates to the NVIDIA inference ecosystem to get the most out of the NVIDIA Blackwell platform, including NVIDIA TensorRT-LLM, NVIDIA TensorRT, TensorRT Model Optimizer, CUTLASS, NVIDIA cuDNN, and popular AI frameworks including PyTorch, JAX, and TensorFlow. Additionally, we share new performance and accuracy data measured on an NVIDIA DGX B200 system, featuring eight Blackwell GPUs connected using two NVLink Switch chips.

The TensorRT ecosystem: A full inference stack optimized for NVIDIA Blackwell

The NVIDIA TensorRT ecosystem is designed to enable developers to optimize their production inference deployments on NVIDIA GPUs. It contains multiple libraries that enable the preparation, acceleration, and deployment of AI models—all of which are now ready to run on the latest NVIDIA Blackwell architecture. This shows the continuously large performance improvements compared to the prior-generation NVIDIA Hopper architecture.

TensorRT Model Optimizer is the first step for optimizing inference speed. It offers state-of-the-art model optimization techniques, including quantization, distillation, pruning, sparsity and speculation decoding, to make models more efficient during inference. The latest TensorRT Model Optimizer 0.25 release supports Blackwell FP4 for both post-training quantization (PTQ) and quantization-aware training (QAT), optimizing the overall inference compute throughput and reducing memory usage for down-stream inference frameworks.

After a model is optimized, a high-performance inference framework is essential to run the model efficiently. TensorRT-LLM provides developers with a toolbox to enable real-time and cost and energy efficient LLM inference. The latest TensorRT-LLM 0.17 release adds support for Blackwell and provides tailored optimizations to Blackwell instructions, memory hierarchy, and FP4.

TensorRT-LLM, architected with PyTorch, delivers peak performance through powerful and flexible kernels for common LLM inference operations and advanced runtime features like in-flight batching, KV cache memory management, and speculative decoding.

Popular deep learning frameworks PyTorch, JAX, and TensorFlow have also been updated to support both inference and training on Blackwell. Other LLM-serving frameworks like vLLM and Ollama can now be used on Blackwell GPUs. Others will also be supported in the near future.

Blackwell with TensorRT inference performance

The Blackwell architecture coupled with TensorRT software enables significant inference performance improvements compared to the Hopper architecture. This performance increase comes from the significantly greater compute performance, memory bandwidth, and optimized software stack to achieve excellent delivered performance.

On popular community models, including DeepSeek-R1, Llama 3.1 405B, and Llama 3.3 70B, the DGX B200 platform running TensorRT software and using FP4 precision is already delivering over 3x more inference throughput compared to the DGX H200 platform.

Bar chart showing comparisons for Llama 3.3 70B and Llama 3.1 405B, all showing much greater inference performance for B200 (FP4) compared to H200 (FP8). — Figure 3. Inference throughput (tokens/sec) speedups of Llama 3.1 70B, Llama 3.1 405B and DeepSeek-R1 using NVIDIA HGX B200 with NVIDIA Blackwell GPUs and FP4 compared to previous GPU generation running FP8

Preliminary specifications. May be subject to change.
TensorRT Model Optimizer v0.23.0. TensorRT-LLM v0.17.0. Max batch size 2048, actual batch size dynamic using TensorRT-LLM Inflight Batching. H200 FP16/BF16 GEMM + FP8 KV cache. B200 FP4 GEMM + FP8 KV cache. Throughput speedup.
Llama 3.3 70B: ISL 2048, OSL 128
Llama 3.1 405B: ISL 2048, OSL 128
DeepSeek-R1: ISL 1024, OSL 1024

When quantizing models to make use of the benefits of lower precision compute, ensuring minimal accuracy loss is critical for production deployments. For DeepSeek-R1, the TensorRT Model Optimizer FP4 post-training quantization (PTQ) leads to minimal accuracy loss compared to the FP8 baseline across a variety of datasets, as shown in Table 1.

	MMLU	GSM8K	AIME 2024	GPQA Diamond	MATH-500
DeepSeek R1-FP8	90.8%	96.3%	80.0%	69.7%	95.4%
DeepSeek R1-FP4	90.7%	96.1%	80.0%	69.2%	94.2%

Table 1. MMLU/GSM8K/AIME24/GPQA/MATH-500 accuracy for FP8/FP4 DeepSeek-R1 model PTQ

Accuracy results using the baseline BF16 precision and with FP4 quantization on the popular Llama 3.1 405B and Llama 3.3 70B are provided in Table 2.

	MMLU Baseline	GSM8K Baseline
Llama 3.1 405B-BF16	86.5%	96.3%
Llama 3.1 405B-FP4	86.1%	96.1%
Llama 3.3 70B-BF16	82.5%	95.3%
Llama 3.3 70B-FP4	80.5%	92.6%

Table 2. MMLU and GSM8K accuracy for BF16/FP4 Llama model PTQ

When deploying in low precisions, such as FP4, QAT can be applied to recover accuracy, provided that fine-tuning datasets are available. To illustrate the value of QAT, NVIDIA Nemotron 4 15B and Nemotron 4 340B models quantized with QAT to FP4 using TensorRT Model Optimizer achieve lossless FP4 quantization compared to the BF16 baseline (Table 3).

	Nemotron 4 15B Base	Nemotron 4 340B Base
BF16 (baseline)	64.2%	81.1%
FP4 with PTQ	61.0%	80.8%
FP4 with QAT	64.5%	81.4%

Table 3. vMMLU 5-shot accuracy for FP4 PTQ and QAT compared to BF16 baseline. Model Optimizer QAT achieved lossless quantization for Nemotron 4 15B and Nemotron 4 340B

Increasing image generation efficiency on Blackwell using TensorRT and TensorRT Model Optimizer with FP4

Previously, TensorRT and TensorRT Model Optimizer have enabled high-performance image generation using diffusion models quantized to 8-bit data formats, including both INT8 and FP8.

Now, NVIDIA Blackwell and FP4 precision deliver even better performance for AI image generation. These benefits also extend to users looking to generate images locally on AI PCs powered by the NVIDIA GeForce RTX 50 series GPUs.

The Flux.1 model family from Black Forest Labs are state-of-the-art text-to-image models that offer superior prompt adherence and the ability to generate complex scenes. Developers can now download FP4 Flux models from Black Forest Lab’s Hugging Face collections and directly deploy them with TensorRT.

These quantized models were generated by Black Forest Labs using TensorRT Model Optimizer FP4 workflows and recipes. To illustrate the benefit of FP4 image generation on Blackwell, a Flux.1-dev model in FP4 achieves speedups of up to 3x in throughput (images per second) compared to FP16, all while compressing VRAM usage by up to 5.2x and maintaining image quality (Table 4).

Diagram showing 3x Flux. 1-dev image throughput leaps with GeForce RTX 5090 with FP4 achieving 2.9x greater throughput than FP16 low-VRAM and 1.7x greater throughput than FP8. — *Figure 4. GeForce RTX 5090 FP4 image throughput of a Flux.1-dev compared to other precisions with 30 steps*

Only the transformer backbone in Flux.1-dev was quantized into FP4, other parts remain to be in BF16 format. The low-VRAM mode in TensorRT DemoDiffusion loads the T5, CLIP, VAE, and the FLUX transformer used in FLUX.1-dev as needed and unloads them when done. This enables the peak memory usage of FLUX to stay within the maximum of those four individual models’ sizes, at the cost of extra latency due to having to load and unload each model during inference.

	VRAM usage (GB)	VRAM usage compression
FP16 (Baseline)	51.4	1x
FP16 low-VRAM	23.3	2.2x
FP8	26.3	1.9x
FP8 low-VRAM	19.9	2.6x
FP4	19.5	2.6x
FP4 low-VRAM	9.9	5.2x

Table 4. Blackwell FP4 VRAM usage and savings compared with different precision formats

Figure 5 shows images generated by a Flux model quantized with FP4, highlighting how image quality and content remain consistent with the BF16 baseline for a given prompt. Additionally, Table 5 provides a quantitative evaluation of FP4 image quality, relevance, and appeal using 1,000 images.

Four images in a grid pattern. The top two are of a colossal star and a planet in space. The bottom two are of a sphere with soft, fluffy white fur on a white background with a shadow leaning the left. — *Figure 5. Blackwell FP4 achieves 3x inference performance for Flux.1-dev model at half the VRAM with comparable quality on RTX 5090*

Input prompt for top image: “Two colossal stars dance in the vastness of space, their intense gravitational forces pulling them closer together. As one star collapses into a black hole, it releases a brilliant burst of energy, creating a dazzling display of shimmering light against the cosmic backdrop. Swirling clouds of gas and dust encircle the spectacle, hinting at the unimaginable power contained within.”

Input prompt for bottom image: “A realistic sphere textured with the soft, fluffy fur of an animal sits in the center of the image on a plain colored background. The fur ripples with gentle, lifelike motion, and the shadows cast by the fur create an engaging visual effect. The render has a high-quality, octane appearance.”

	Image Reward	CLIP-IQA	CLIPScore
BF16	1.118	0.927	30.15
FP4 PTQ	1.096	0.923	29.86
FP4 QAT	1.119	0.928	29.92

Table 5. Quantitative evaluation of FP4 image quality using Image Reward,CLIP-IQA, and CLIPScore (higher the better)

Flux.1-dev model, 30 steps, 1K images. TensorRT Model Optimizer v0.23.0 FP4 recipe. Simulated on NVIDIA H100 GPUs on 1/24/2025. The simulation is mathematically identical to the TensorRT kernel level as on RTX 5090. The actual score may differ slightly on an RTX 5090.

The TensorRT 10.8 update can now run Flux.1-Dev and Flux.1-Schnell models with peak FP4 performance on higher-end GeForce RTX 50 series GPUs. With –low-vram mode, you can even run these models on a limited memory config like GeForce RTX 5070. Moreover, TensorRT also supports running Depth and Canny Flux ControlNets provided by Black Forest Labs. You can try it now using TensorRT demo/Diffusion.

Blackwell-optimized deep learning primitives with cuDNN

Since its introduction in 2014, NVIDIA cuDNN has been fundamental to accelerating deep learning workloads on GPUs. By providing highly optimized implementations of core deep learning primitives, it has enabled frameworks like PyTorch, TensorFlow, and JAX to deliver state-of-the-art performance. Through seamless integration with these frameworks and optimized performance across different GPU architectures, cuDNN has established itself as the performance engine powering end-to-end deep learning workloads from training to inference.

With the release of cuDNN 9.7, we’re extending support to NVIDIA Blackwell architecture across both Data Center and GeForce product lines. Developers can expect significant performance improvements when migrating cuDNN operations to the latest Blackwell Tensor Cores. The library provides optimized general matrix multiply (GEMM) APIs that harness Blackwell advanced capabilities for block-scaling FP8 and FP4 operations, abstracting away the complexity of low-level optimization so developers can focus on innovation.

cuDNN delivers substantial performance improvements for FP8 Flash Attention operations, achieving up to 50% speedups on forward propagation and 84% speedups on backward propagation with FP8 kernels. The library also provides highly optimized GEMM operations with advanced fusion capabilities on Blackwell architecture. Looking ahead, cuDNN will continue to expand its fusion support to enable even greater performance gains for deep learning workloads.

Diagram shows FP8 Flash Attention performance speedups on Blackwell B200 over Hopper H100 (head dimension = 128, causal mask = off, batch size = 1). Results are presented across various sequence lengths (4k to 64k) and number of heads (8 & 16). — *Figure 6. FP8 Flash Attention speedup comparisons between NVIDIA Blackwell B200 and NVIDIA Hopper H100*

Crafting high-performance Blackwell kernels with CUTLASS

CUTLASS, since its 2017 debut, has been instrumental for researchers and developers implementing high-performance CUDA kernels on NVIDIA GPUs. By providing developers with comprehensive tools to design custom operations, such at GEMMs and Convolutions, targeting NVIDIA Tensor Cores, it has been critical for the development of hardware-aware algorithms, powering breakthroughs like FlashAttention and establishing itself as a cornerstone for GPU-accelerated computing.

With the release of CUTLASS 3.8, we’re extending support to NVIDIA Blackwell architecture, enabling developers to harness next-generation Tensor Cores with support for all new data types. This includes the new narrow precision MX formats and NVIDIA’s own FP4, empowering developers to optimize custom algorithms and production workloads with the latest innovations in accelerated computing. Figure 7 shows that we are able to achieve up to 98% relative peak performance for Tensor Core operations.

Diagram showing performance comparisons between FP16, BF16, TF32, INT8, FP8, and FP4 for CUTLASS Blackwell GEMMs. — *Figure 7. Relative peak performance achievable across different NVIDIA Blackwell Tensor Cores*

Benchmarks were performed on a B200 system. M=K=16384 and N=17290.

CUTLASS brings popular features like Grouped GEMM and Mixed Input GEMM operations to Blackwell. Grouped GEMM helps accelerate MoE models by offering a more efficient way to perform multiple expert computations in parallel. Mixed Input GEMMs power quantized kernels that can reduce GPU memory requirements for LLMs where model weights dominate GPU memory consumption.

OpenAI Triton support for Blackwell

OpenAI Triton compiler also now supports Blackwell, enabling developers and researchers to leverage the latest Blackwell architecture features with a Python-based compiler. OpenAI Triton can now take advantage of the latest architectural innovations in the Blackwell architecture and can achieve near-optimal performance on several critical use cases. To learn more, see OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability, co-authored by NVIDIA and OpenAI.

Summary

NVIDIA Blackwell architecture incorporates many breakthrough capabilities that help accelerate generative AI inference, including second-generation Transformer Engine with FP4 Tensor Cores and fifth-generation NVLink with NVLink Switch. NVIDIA announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over 250 tokens per second per user or a maximum throughput of over 30,000 tokens per second on the massive, state-of-the-art 671 billion parameter DeepSeek-R1 model.

A rich suite of libraries, now optimized for NVIDIA Blackwell, will enable developers to achieve significant increases in inference performance for both today’s AI models and tomorrow’s evolving landscape. Learn more about the NVIDIA AI Inference platform and stay informed about the latest AI inference performance updates.

Acknowledgements

This work would not have been possible without the exceptional contributions of many, including Matthew Nicely, Nick Comly, Gunjan Mehta, Rajeev Rao, Dave Michael, Yiheng Zhang, Brian Nguyen, Asfiya Baig, Akhil Goel, Paulius Micikevicius, June Yang, Alex Settle, Kai Xu, Zhiyu Cheng, and Chenjie Luo.