Accelerate AI With NVIDIA RTX PCs

NVIDIA RTX? PCs accelerate your AI features for maximum performance and lowest latency. NVIDIA offers broad support on all major AI inference backends to meet every developerâ€™s needs.

Overview of AI Inference Backends

Developers need to consider several factors before choosing a deployment ecosystem and path for their application. Each inference backend offers specific model optimization tools and deployment mechanisms for efficient application integration. Inference backends map model execution to hardware, with top options optimized for NVIDIA RTX GPUs. Achieving peak AI performance requires model optimization techniques like quantization and pruning. Higher-level interfaces streamline application packaging, installation, and integration, enhancing efficiency.

ONNX Runtime With DirectML

NVIDIA TensorRT

Llama.cpp

PyTorch

Who is it for?

For developers who want to deploy performant, cross-vendor apps across Windows OS.

Inferencing Backends

ONNX Runtime, in conjunction with the DirectML backend, is a cross-platform machine-learning model accelerator for Windows, allowing access to hardware-specific optimizations.

For AI Modelsâ€”Get Started With DirectML AI Inferencing

For Generative AIâ€”Get Started With ONNX Runtime GenAI Inferencing

Model Optimization

The Olive optimization toolkit offers quantization across CPUs, NPUs, and NVIDIA RTX GPUsâ€”with easy integration into the ONNX-Runtime & DirectML inferencing backend. You can also use TensorRT Model Optimizer to perform quantization for ONNX models.

Get Started With Olive

Get Started with TensorRT Model Optimizer

Deployment Mechanisms

Packaging and deploying ONNX Runtime apps on PCs is simple. DirectML comes pre-installed in Windows. All you need to do is ship your model and, for LLMs, the ONNX Runtime GenAI SDK.

Get Started With an End-to-End Sample

Introduction to ONNX Runtime

Watch Video (8:12)

ONNXRuntime-GenAI Installation and Inference Walkthrough

Watch Video (6:00)

Who is it for?

For LLM developers who want wide reach with cross-vendor and cross-OS support.

Inferencing Backends

Llama.cpp enables LLM-only inference across a variety of devices and platforms with unified APIs. This requires minimal setup, delivers good performance, and is a lightweight package. Llama.cpp is developed and maintained by a large open-source community and offers a wide range of LLM support.

Get Started With Llama.cpp

Model Optimization

Natively, Llama.cpp offers an optimized model format with GGUF. This format allows for optimal model performance and lightweight deployment. It uses quantization techniques to reduce the size and computational requirements of the model to run across a variety of platforms.

Get Started With Llama.cpp Model Quantization

Deployment Mechanisms

With Llama.cpp, you can deploy in an out-of-process format, with a server running on localhost. Apps communicate with this server using a REST API. Some popular tools for this include Cortex, Ollama, and LMStudio. For in-process execution, it requires installation of Llama.cpp in .lib or .dll formats within an app.

Get Started With Ollama

Get Started With LMStudio

Get Started With Cortex

Get Started With In-process Execution

Who is it for?

For developers looking for the latest features and maximum performance on NVIDIA RTX GPUs.

Inferencing Backends

NVIDIA? TensorRT? offers maximum performance deep learning inference on NVIDIA RTX GPUs, with GPU-specific TRT engines that extract the last ounce of performance from the GPU.

Get Started With TensorRT

Get Started With TensorRT-LLM

Optimize Your Models

To optimize models within the TensorRT ecosystem, developers can use TensorRT-Model Optimizer. This unified library offers state-of-the-art model optimization techniques, such as quantization, pruning, and distillation. It compresses deep learning models for downstream deployment frameworks like TensorRT to optimize inference speed on NVIDIA GPUs.

Get Started With TensorRT Model Optimizer

Deployment Mechanisms

Deploying TensorRT models requires 3 things: TensorRT, a TensorRT optimized model, and a TensorRT engine.
TensorRT engines can be pre-generated ahead of time, or generated within your app using Timing Caches.

Get Started With NVIDIA TensorRT Deployment

Who is it for?

For developers looking to experiment with and evaluate AI while maintaining cohesion with model training pipelines.

Inferencing Backends

PyTorch is a popular open-source machine learning library that offers cross-platform and cross-device inferencing options.

Get Started With PyTorch

Model Optimization

PyTorch offers several leading algorithms for model quantization, ranging from quantization-aware training (QAT) to post-training quantization (PTQ), as well as sparsity for in-framework model optimization.

Get Started With torchao

Deployment Mechanisms

To serve models in production applications within PyTorch, developers often deploy using an out-of-process format. This would require building python packages, generating model files and standing up a localhost server. This can be streamlined with frameworks such as tocrchserve and HuggingFace Accelerate.

Get Started With torchserve

Get Started With HuggingFace Accelerate

Choosing an Inferencing Backend


	ONNX Runtime With DirectML	TensorRT and TensorRT-LLM	Llama.cpp	PyTorch-CUDA
Performance	Faster	Fastest	Fast	Good
OS Support	Windows	Windows and Linux (TensortRT-LLM is Linux Only)	Windows, Linux, and Mac	Windows and Linux
Hardware Support	Any GPU or CPU	NVIDIA RTX GPUs	Any GPU or CPU	Any GPU or CPU
Model Checkpoint Format	ONNX	TRT	GGUF or GGML	PyT
Installation Process	Pre-Installed on Windows	Installation of Python packages required	Installation of Python packages required	Installation of Python packages required
LLM Support	??	??	??	??
CNN Support	??	??	-	??
Device- Specific Optimizations	Microsoft Olive	TensorRT-Model Optimizer	Llama.cpp	-
Python	??	??	??	??
C/C++	??	??	??	??
C#/.NET	??	-	??	-
Javascript	??	-	??	-

Latest NVIDIA News

ONNX Runtime

NVIDIA TensorRT

Llama.cpp

PyTorch

Stay up to date on how to power your AI apps with NVIDIA RTX PCs.

Learn More