• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Accelerate AI With NVIDIA RTX PCs

    NVIDIA RTX? PCs accelerate your AI features for maximum performance and lowest latency. NVIDIA offers broad support on all major AI inference backends to meet every developer’s needs.


    Overview of AI Inference Backends

    Developers need to consider several factors before choosing a deployment ecosystem and path for their application. Each inference backend offers specific model optimization tools and deployment mechanisms for efficient application integration. Inference backends map model execution to hardware, with top options optimized for NVIDIA RTX GPUs. Achieving peak AI performance requires model optimization techniques like quantization and pruning. Higher-level interfaces streamline application packaging, installation, and integration, enhancing efficiency.

    Who is it for?

    For developers who want to deploy performant, cross-vendor apps across Windows OS.

    Inferencing Backends

    ONNX Runtime, in conjunction with the DirectML backend, is a cross-platform machine-learning model accelerator for Windows, allowing access to hardware-specific optimizations.

    For AI Models—Get Started With DirectML AI Inferencing
    For Generative AI—Get Started With ONNX Runtime GenAI Inferencing

    Model Optimization

    The Olive optimization toolkit offers quantization across CPUs, NPUs, and NVIDIA RTX GPUs—with easy integration into the ONNX-Runtime & DirectML inferencing backend. You can also use TensorRT Model Optimizer to perform quantization for ONNX models.

    Get Started With Olive
    Get Started with TensorRT Model Optimizer

    Deployment Mechanisms

    Packaging and deploying ONNX Runtime apps on PCs is simple. DirectML comes pre-installed in Windows. All you need to do is ship your model and, for LLMs, the ONNX Runtime GenAI SDK.

    Get Started With an End-to-End Sample

    Introduction to ONNX Runtime

    Watch Video (8:12)

    ONNXRuntime-GenAI Installation and Inference Walkthrough

    Watch Video (6:00)

    Who is it for?

    For LLM developers who want wide reach with cross-vendor and cross-OS support.

    Inferencing Backends

    Llama.cpp enables LLM-only inference across a variety of devices and platforms with unified APIs. This requires minimal setup, delivers good performance, and is a lightweight package. Llama.cpp is developed and maintained by a large open-source community and offers a wide range of LLM support.

    Get Started With Llama.cpp

    Model Optimization

    Natively, Llama.cpp offers an optimized model format with GGUF. This format allows for optimal model performance and lightweight deployment. It uses quantization techniques to reduce the size and computational requirements of the model to run across a variety of platforms.

    Get Started With Llama.cpp Model Quantization

    Deployment Mechanisms

    With Llama.cpp, you can deploy in an out-of-process format, with a server running on localhost. Apps communicate with this server using a REST API. Some popular tools for this include Cortex, Ollama, and LMStudio. For in-process execution, it requires installation of Llama.cpp in .lib or .dll formats within an app.

    Get Started With Ollama
    Get Started With LMStudio

    Get Started With Cortex
    Get Started With In-process Execution

    Who is it for?

    For developers looking for the latest features and maximum performance on NVIDIA RTX GPUs.

    Inferencing Backends

    NVIDIA? TensorRT? offers maximum performance deep learning inference on NVIDIA RTX GPUs, with GPU-specific TRT engines that extract the last ounce of performance from the GPU.

    Get Started With TensorRT
    Get Started With TensorRT-LLM

    Optimize Your Models

    To optimize models within the TensorRT ecosystem, developers can use TensorRT-Model Optimizer. This unified library offers state-of-the-art model optimization techniques, such as quantization, pruning, and distillation. It compresses deep learning models for downstream deployment frameworks like TensorRT to optimize inference speed on NVIDIA GPUs.

    Get Started With TensorRT Model Optimizer

    Deployment Mechanisms

    Deploying TensorRT models requires 3 things: TensorRT, a TensorRT optimized model, and a TensorRT engine.
    TensorRT engines can be pre-generated ahead of time, or generated within your app using Timing Caches.

    Get Started With NVIDIA TensorRT Deployment

    Who is it for?

    For developers looking to experiment with and evaluate AI while maintaining cohesion with model training pipelines.

    Inferencing Backends

    PyTorch is a popular open-source machine learning library that offers cross-platform and cross-device inferencing options.

    Get Started With PyTorch

    Model Optimization

    PyTorch offers several leading algorithms for model quantization, ranging from quantization-aware training (QAT) to post-training quantization (PTQ), as well as sparsity for in-framework model optimization.

    Get Started With torchao

    Deployment Mechanisms

    To serve models in production applications within PyTorch, developers often deploy using an out-of-process format. This would require building python packages, generating model files and standing up a localhost server. This can be streamlined with frameworks such as tocrchserve and HuggingFace Accelerate.

    Get Started With torchserve
    Get Started With HuggingFace Accelerate

    Choosing an Inferencing Backend

    ONNX Runtime With DirectML
    TensorRT and  TensorRT-LLM
    Llama.cpp
    PyTorch-CUDA
    Performance
    Faster
    Fastest
    Fast
    Good
    OS Support
    Windows
    Windows and Linux
    (TensortRT-LLM is Linux Only)
    Windows, Linux, and Mac
    Windows and Linux
    Hardware Support
    Any GPU or CPU
    NVIDIA RTX GPUs
    Any GPU or CPU
    Any GPU or CPU
    Model Checkpoint Format
    ONNX
    TRT
    GGUF or GGML
    PyT
    Installation Process
    Pre-Installed on Windows
    Installation of Python packages required
    Installation of Python packages required
    Installation of Python packages required
    LLM Support
    ??
    ??
    ??
    ??
    CNN Support
    ??
    ??
    -
    ??
    Device- Specific Optimizations
    Microsoft Olive
    TensorRT-Model Optimizer
    Llama.cpp
    -
    Python
    ??
    ??
    ??
    ??
    C/C++
    ??
    ??
    ??
    ??
    C#/.NET
    ??
    -
    ??
    -
    Javascript
    ??
    -
    ??
    -

    Latest NVIDIA News


    Stay up to date on how to power your AI apps with NVIDIA RTX PCs.

    Learn More

    人人超碰97caoporen国产