Seamlessly Scale AI Across Cloud Environments with NVIDIA DGX Cloud Serverless Inference

NVIDIA DGX Cloud Serverless Inference is an auto-scaling AI inference solution that enables application deployment with speed and reliability. Powered by NVIDIA Cloud Functions (NVCF), DGX Cloud Serverless Inference abstracts multi-cluster infrastructure setups across multi-cloud and on-premises environments for GPU-accelerated workloads. Whether managing AI workloads, high-performance computing (HPC), AI simulations, or containerized applications, the platform enables you to scale globally while abstracting the underlying infrastructure. Deploy once and scale everywhere.

Independent Software Vendors (ISVs) often face challenges when deploying and scaling AI applications. These applications need to be deployed worldwide or closer to where the customer infrastructure resides. This may require deploying across multiple clouds, data centers, and geographies, leading to complex infrastructure operations. Serverless AI inference addresses this challenge by abstracting the underlying infrastructure across clouds, data centers, and clusters, giving ISVs a simple, easy-to-use, and consistent method to deploy AI applications.

NVIDIA DGX Cloud Serverless Inference acts as a horizontal aggregator for compute infrastructure. ISVs can seamlessly mix resources from NVIDIA, NVIDIA Cloud Partners, their private cloud from a cloud service provider (CSP), or their on-prem capacity. Whether you want to expand capacity temporarily or test new cloud providers, the solution provides unmatched flexibility.

This post explains how NVIDIA DGX Cloud Serverless Inference empowers developers to scale AI seamlessly across cloud environments with global load-balancing, autoscaling, multicloud flexibility for AI, graphical, and job workloads using a single API endpoint.

A diagram showing different forms of compute capacity at the bottom, including capacity from NVIDIA, NVIDIA Cloud Partners, and BYO-Compute. Then above that, a box titled "DGX Cloud Serverless Inference" with sub boxes showing capabilities, including Auto-Scaling, Load Balancing, Targeted Deployments, Observability, and Versioning. Above that, boxes representing different workloads that can run on DGX Cloud Serverless Inference, including build.nvidia.com, NIMs, Blueprints, Simulations, Bring your own HPC app container, models, or helm charts. — *Figure 1. DGX Cloud Serverless Inference is a horizontal layer abstracting compute across compute*

Key benefits for ISVs

DGX Cloud Serverless Inference is designed for developers and ISVs to focus on what they do best: build applications. NVCF simplifies the delivery and scaling of those applications without them worrying about managing GPUs or operating infrastructure. Key benefits include:

Reduced infrastructure and operational burden: Deploy applications closer to your customer infrastructure, regardless of the cloud provider, with a single, unified, autoscaling service.
Agility for business growth: Add compute capacity quickly to support burst or short-term workloads.?
Easy transition options: Integrate existing compute setups into the platform using bring your own (BYO) compute capabilities.?
Risk-free exploration: Trial new geographies, providers, or GPU types before committing to long-term investments. Support use cases like data sovereignty requirements, low latency requirements, and lower costs.?

Which workloads can DGX Cloud Serverless Inference run?

DGX Cloud Serverless Inference supports a wide variety of containerized workloads, including AI, graphical, and job workloads (Figure 2). These workloads already run on DGX Cloud Serverless Inference, including NVIDIA AI workloads running on build.nvidia.com or simulation workloads such as NVIDIA Omniverse.

AI workloads

Handle cutting-edge large language models (LLMs), including large models that do not fit on single nodes, requiring multinode inference. DGX Cloud Serverless Inference excels at a variety of workload types, including:

Object detection?
Image, 3D, and video generation?
Advanced machine learning models?

Graphical workloads

NVIDIA is founded in graphical computing, meaning the platform is uniquely suited for graphical-intensive tasks, including:

Digital twins and simulations?
Interactive streaming services?
Digital human and robotics workflows?

With compute optimized specifically for graphical workloads, DGX Cloud Serverless Inference integrates seamlessly with technologies like Omniverse or NVIDIA Aerial, enhancing performance when it matters most.

Job workloads

DGX Cloud Serverless Inference is ideal for workloads that need batch processes and run to completion. Whether it’s rendering tasks or fine-tuning AI models, the platform handles “run-to-completion” workloads, ensuring compute resources are used efficiently. Use cases include:

NVIDIA TensorRT engine optimization?
Physical AI model training through video data generation

How to get started using DGX Cloud Serverless Inference

There are multiple ways to bring a workload to DGX Cloud Serverless Inference. As shown in Figure 1, the fastest and easiest way to get started is to use NVIDIA NIM microservices containers and NVIDIA Blueprints on build.nvidia.com. DGX Cloud Serverless Inference includes elastic NIM capabilities directly within the UI, making it easy to scale these optimized models.

Alternatively, ISVs can use a custom container and allow DGX Cloud Serverless Inference to handle the autoscaling and global load balancing across various compute targets. ISVs can also use Helm charts for more complex deployments.

Once the workload is deployed to DGX Cloud Serverless Inference, the ISV application can call on the model through an API endpoint. DGX Cloud Serverless Inference abstracts the compute clusters behind this API endpoint. This API endpoint routes invocation requests to a gateway and global request queue, which can leverage multiple regional queues for optimal load balancing. The ISV can mix and match multiple clusters behind the API endpoint.

For example, Figure 3 demonstrates a scenario where an ISV might be using two different clusters from two different providers. One of these providers could be compute provided by an NVIDIA Cloud Partner or an ISV’s private cloud in CSPs. The NVIDIA Cluster Agent (NVCA) software is installed in this cluster, allowing the compute within this cluster to be visible and available for servicing the workload. This cluster could also be instances within the ISV’s private cloud or on-prem servers. The other provider in Figure 3 could be reserved or on-demand compute provided through DGX Cloud. The ISV can use any combination of cluster setups depending on unique business needs.

A diagram showing an AI partner service which then calls the DGX Cloud Serverless Inference API. This then calls on a request queue. The request queue can call on different clusters. One cluster is a NCP, private cloud, or on prem cluster with kubernetes, which then has the NVCF cluster agent software installed to connect to the service gateway. The other cluster is one from DGX Cloud, demonstrating the ability for the request queue to draw from multiple pools of compute across different clusters. — *Figure 3. DGX Cloud Serverless Inference abstracts multiple clusters behind a single uniform API*

Clusters can also be tagged with attributes to help target deployments. For instance, a deployment can be targeted only to clusters within a specific geography, that have been graphics optimized, that have caching support, or that meet a specific certification (for example, SOC2, HIPAA). This gives the ISV more control over where workloads can be run. For additional details, see the Function Deployment documentation.

Finally, the DGX Cloud Serverless Inference API is unopinionated, providing maximum flexibility in how it is used. The API is unopinionated beyond the URL and authorization header, with the payload providing flexibility depending on the workload’s needs. For instance, in the case of an LLM, the payload can be tailored to meet the OpenAI chat completions API format. The ISV developer also has flexibility in using HTTP Polling, HTTP Streaming, and gRPC. For more information, see the API documentation.

How are functions deployed?

NVIDIA Cloud Functions (NVCF) is the control plane layer of DGX Cloud Serverless Inference. NVIDIA Cluster agents are installed on the worker nodes to communicate with the control plane layer to register the cluster. It enables seamless deployment and scaling of AI inference workloads through a streamlined, serverless approach. The deployment process follows these key steps (Figure 4):

Push artifacts to NGC Registry: AI developers or service providers push required assets—such as containers, AI models, Helm charts, and other resources—to the NVIDIA NGC Registry. This serves as the central repository for managing inference-ready artifacts.
Create the function: Using the AI Partner Service, users define a function that specifies how the AI model or service should execute. This step abstracts the complexities of infrastructure management.
Deploy the function: Once created, the function is deployed across available compute resources. NVCF intelligently manages deployment, ensuring efficient execution across multiple GPUs.
Worker nodes are deployed and scaled: NVCF dynamically provisions worker nodes based on demand, auto-scaling the infrastructure across NVIDIA DGX Cloud or partner compute environments.
Fetch containers and models: The worker nodes retrieve the necessary containers and models from the NGC Registry, ensuring the latest versions are executed.

Functions are deployed starting at the AI partner service, which pushes artifacts to the NGC registry in step 1. Then, in step 2, the AI partner service creates the function. In step 3, the AI partner service deploys the function. In step 4, the worker nodes are deployed and scaled. Step 5 allows the NGC registry to fetch containers and models. — *Figure 4. The deployment process for DGX Cloud Serverless Inference*

This process enables ISVs and developers to focus on AI innovation rather than infrastructure management, benefiting from autoscaling, high availability, and cost-efficient GPU utilization.

How are ISVs using DGX Cloud Serverless Inference?

DGX Cloud Serverless Inference is already driving innovation worldwide, powering interactive AI experiences, large-scale simulation environments, and more. The following ISVs have been leveraging the technology as part of the NVIDIA preview program:

Aible: An AI-powered data science solutions provider that automates data engineering and machine learning workflows to deliver measurable business impact for enterprises. Aible demonstrated how NVIDIA Cloud Functions Serverless GPUs improve generative AI TCO by 200x for end-to-end retrieval-augmented generation RAG solutions.
Bria: A visual generative AI platform for developers with models trained with 100% licensed data uses NVIDIA Cloud Functions to scale their inferencing needs for text to image generation. Bria is able to consume GPUs on demand, lowering their overall TCO using NVIDIA L40S and NVIDIA H100 GPUs.?
Cuebric: A generative AI tool that enables filmmakers and creatives to quickly transform concepts into photorealistic, film-ready virtual environments and backgrounds in minutes. Cuebric uses NVIDIA Cloud Functions to burst on demand and scale its AI workloads globally.
Outerbounds: An infrastructure and tooling provider for ML, AI, and data scientists, Outerbounds provides its customers with on-demand scalable GPU infrastructure on top of NVIDIA Cloud Functions. To reduce cost, Outerbounds uses scale to zero instances with fast cold boot start-up time functionality from NVIDIA Cloud Functions.?

Whether running advanced perceptual systems, high-fidelity simulations, or dynamic AI workloads, DGX Cloud Serverless Inference ensures optimal performance and resource allocation.

Get started with DGX Cloud Serverless Inference

ISVs and NVIDIA Cloud Partners can now try DGX Cloud Serverless Inference. For ISVs, DGX Cloud Serverless Inference can be a low-risk way to onboard different compute providers, including from the ISV’s private cloud or NVIDIA Cloud Partners, with DGX Cloud Serverless Inference serving as a “translation layer” between different compute providers.

For NVIDIA Cloud Partners, onboarding as a DGX Cloud partner can enable easier adoption by ISVs and a more seamless ISV transition from their private cloud or DGX Cloud compute to compute provided by leading NVIDIA Cloud Partners.

To learn more, visit DGX Cloud Serverless Inference, where you can sign up to start a 30-day evaluation.

Seamlessly Scale AI Across Cloud Environments with NVIDIA DGX Cloud Serverless Inference

Key benefits for ISVs