Autoscaling NVIDIA Riva Deployment with Kubernetes for Speech AI in Production

Speech AI applications, from call centers to virtual assistants, rely heavily on automatic speech recognition (ASR) and text-to-speech (TTS). ASR can process the audio signal and transcribe the audio to text. Speech synthesis or TTS can generate high-quality, natural-sounding audio from the text in real time. The challenge of Speech AI is to achieve high accuracy and meet the latency requirements for real-time interaction.

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications, achieving high accuracy while delivering high throughput using state-of-the-art models. Riva provides world-class speech recognition and text-to-speech skills, to interact with humans across multiple languages. Riva can be deployed on-prem, in the cloud, or on edge or embedded platforms, and you can scale up Riva servers to handle hundreds or thousands of real-time streams with low latency. This post walks you through, step by step, how to deploy Riva servers on a large scale with Kubernetes for autoscaling and Traefik for load balancing.

Riva can be customized for your dataset across different applications, such as chatbots, virtual assistants, and transcription services. The pretrained models can be fine-tuned with the NVIDIA TAO Toolkit using transfer learning, making your development much faster than pretraining your models from scratch. The pretrained models for speech recognition and speech synthesis are provided in the NVIDIA GPU Cloud.

The non-English (Spanish, German, and Russian) ASR models are trained on a variety of open-source datasets, such as Mozilla Common Voice, as well as private datasets. Riva automatic speech recognition models are developed to provide out-of-the-box accuracy and serve as a great starting point for adapting to industry, jargon, dialect, or even noisy surroundings.

For users to enjoy lifelike dialogue, speech applications must offer human-like expressions. Using the FastPitch model, Riva helps developers customize the text-to-speech pipeline and create expressive human-like voices. For example, during inference time, developers can vary voice pitch and speed using SSML tags. The latest state-of-the-art models such as Fastpitch in Riva help text-to-speech pipelines run several times faster than other competing options in the market.?

Hardware and software prerequisites

To scale your Riva deployment to handle hundreds to thousands of real-time streams with low latency, you can make use of Kubernetes. Preparing a Kubernetes cluster for Riva deployment involves meeting several prerequisites.

First, ensure that your Kubernetes nodes have GPUs available. Running Riva’s deep neural networks on the GPU is critical for ensuring the low latency and high bandwidth requirements for real-time streaming. Riva is supported on NVIDIA Volta or later GPUs, and it is recommended that each GPU have a minimum of 16 GB of VRAM. Riva can also be deployed on public cloud compute instances with the appropriate GPU resources, such as AWS EC2 instances.

Next, enable access to the GPUs from workloads running in your Kubernetes cluster. The easiest way to do this is to use the NVIDIA GPU Operator, which will automatically set up and manage the NVIDIA software components on the worker nodes.

If you prefer, you can also install and manage the GPU software components yourself. In this case, you need an NVIDIA GPU Driver and the NVIDIA Container Toolkit on each of your nodes, as well as the NVIDIA device plugin for Kubernetes. This enables Kubernetes to discover which nodes have GPUs and make them available to containers running on those nodes. You can also optionally install the GPU Feature Discovery plugin, which will generate labels for each node describing the GPU features available.

The publicly available version of NVIDIA GPU Operator is free of charge and includes community support only. NVIDIA AI Enterprise is recommended for enterprise customers. With an active subscription, you can use a fully validated and enterprise-supported version of GPU Operator. For more information, see the NVIDIA AI Enterprise documentation.

To support autoscaling, Prometheus must be installed in your Kubernetes cluster to collect metrics and report them to your control plane. Also ensure that the Kubernetes API Aggregation Layer is enabled in your Kubernetes cluster, which allows the autoscaler to access external APIs like Prometheus.

To use the Helm chart in the next section, install Helm using the following script:

$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
$ chmod 700 get_helm.sh
$ ./get_helm.sh

Then use the kube-prometheus-stack Helm chart to install the Prometheus monitoring stack:

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm repo update
$ helm install <name> --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
prometheus-community/kube-prometheus-stack

Autoscale Riva deployment

The four main steps to deploying Riva for speech AI inference with autoscaling and load balancing are:

Create Kubernetes deployment for Riva servers
Create Kubernetes service to expose Riva servers as a network service
Autoscale Riva deployment:
- Expose NVIDIA Triton metrics to Prometheus using kube-prometheus and ServiceMonitor
- Deploy Prometheus Adapter for the custom metric
- Create the Horizontal Pod Autoscaler (HPA)
Use Traefik to distribute client requests among all the Pods

Helm chart for Riva deployment

You can use a Helm chart for this deployment, as it is easy to modify and deploy across different environments. A chart is organized as a collection of files inside the riva directory, as shown below. Inside the riva directory, Helm will expect a structure that matches the files below:

chart.yaml: This is where you put all the information about the chart you are packaging. So, for example, your version number and the dependencies.

values.yaml: This is where you define all the values you want to inject into your templates directory.

charts: This is the directory where you store other charts that the Riva chart depends on. The Riva depends on the Traefik and Prometheus adapter charts for ingress and autoscaling respectively.

templates: This is the directory where you put the actual manifest you are deploying with the chart.

For the Riva server, you need deployment.yaml, service.yaml, hpa.yaml, and ingressroute.yaml all in the templates directory. These will all get certain configuration values from values.yaml. Conventionally, Helm charts also include helper templates inside of a partials file, usually called _helpers.tpl.

You can change the configurations for your deployment by modifying values.yaml for the fields in the files: templates/*.yaml.

riva/
     chart.yaml
     values.yaml
     charts/
     templates/
            deployment.yaml
            service.yaml
            hpa.yaml
            ingressroute.yaml
            _helpers.tpl

The chart.yaml file below needs to specify all the information about the chart, such as the Riva version and the repositories of Traefik Helm Chart and Prometheus Community Kubernetes Helm Charts.

apiVersion: v1
appVersion: 1.9.0-beta
description: Riva Speech Services
name: riva
version: 1.9.0-beta

dependencies:  
    - name: traefik    
      version: "~10.6.2"    
      repository: "https://helm.traefik.io/traefik"    
      tags:      
         - loadBalancing
    - name: prometheus-adapter   
      version: "~3.0.0"    
      repository: "https://prometheus-community.github.io/helm-charts"    
      tags:     
         - autoscaling

Create Kubernetes deployment for Riva servers

A Kubernetes deployment provides declarative updates for Pods and ReplicaSets. The deployment.yaml below creates a set of replicated Kubernetes Pods for Riva servers. You can specify how many Kubernetes Pods you want to start your deployment with, indicated by the .spec.replicas field. The .spec.containers field tells each Kubernetes Pod to have a Riva server docker container running on a GPU device.

The .spec.selector field defines how the deployment finds which Pods to manage. Port number 50001 is specified for Riva servers to receive inference requests from clients by GRPC, while port 8002 is for NVIDIA Triton metrics. The fields in deployment.yaml can be modified in values.yaml based on your deployment.

kind: Deployment
metadata:
  […]
spec:
  replicas: {{ .Values.autoscaling.minReplicas }}
  selector:
  […]
  template:
    metadata:
      labels:
        app: {{ template "riva-server.name" . }}

spec:
      volumes:
      […]
      containers:
        - name: riva-speech-api
          image: {{ $server_image }}              
         imagePullPolicy: {{ .Values.riva.pullPolicy }}              
         resources:            
            limits:              
              nvidia.com/gpu: {{ .Values.riva.numGpus }}
command: ["/opt/riva/bin/start-riva"]          
args:            
   {{- range $service, $enabled := .Values.riva.speechServices }}            
    - "--{{$service}}_service={{$enabled}}"            
   {{- end }}     

env:            
    - name: TRTIS_MODEL_STORE                  
      value: "/data/models“

ports:
    - containerPort: 50051
      name: speech-grpc
    - containerPort: 8000
      name: http-triton
    - containerPort: 8001
      name: grpc-triton
     - containerPort: 8002
       name: metrics-triton 

volumeMounts:            
    - mountPath: /data/             
      name: workdir
……

Create Kubernetes service for Riva servers

A Kubernetes service is an abstract way to expose an application running on a set of Pods as a network service. The service.yaml provided below exposes Riva servers as a network service, so Riva servers are ready to receive inference requests from the clients. Since NVIDIA Triton Inference Server is an intrinsic component of Riva servers, NVIDIA Triton metrics can be collected from Port number 8002 in the same way as if you are using NVIDIA Triton Inference Server. The fields in service.yaml can be modified in values.yaml based on your deployment.

When creating a service, you have the option of automatically creating an external load balancer by setting clusterIP: None. (If you want to specify a given IP address, see the Kubernetes service documentation.) This enables Kubernetes to automatically assign an externally accessible IP address that sends traffic to the correct port on your Kubernetes cluster nodes.

apiVersion: v1
kind: Service
metadata:
  name: {{ template "riva-server.fullname" . }}
  labels:
    app: {{ template "riva-server.name" . }}
spec:
  selector:
     app: {{ template "riva-server.name" . }}
   
  clusterIP: None

   ports:
    - port: 50051         
      targetPort: speech-grpc
      name: riva-speech
    - port: 8000      
      targetPort: http      
      name: triton-http    
    - port: 8001      
      targetPort: grpc      
      name: triton-grpc    
    - port: 8002      
      targetPort: metrics      
      name: metrics

Autoscale Riva deployment

Graphic showing a Kubernetes deployment example with three replicated Pods, each Pod having a Riva server running on a GPU. — Figure 1. An example of Kubernetes deployment with three replicated Pods, each Pod having a Riva server running on a GPU. Expose NVIDIA Triton metrics to Prometheus using ServiceMonitor and kube-prometheus.

In order to automatically change the number of Riva servers on Kubernetes Pods, you need a ServiceMonitor to monitor Riva Service for target discovery by Prometheus. You also need kube-Prometheus to deploy Prometheus and link Prometheus to metric endpoints. In the yaml file below, ServiceMonitor is set to monitor Riva Service every 15 seconds.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:  
     name: {{ template "riva-server-metrics-monitor.fullname" . }}  
     namespace: {{ .Release.Namespace }}  
     labels:    
        app: {{ template "riva-server-metrics-monitor.name" . }}    
        chart: {{ template "riva-server.chart" . }}   
        release: {{ .Release.Name }}    
spec:  
    selector:    
        matchLabels:      
             app: {{ template "riva-server.name" . }} 
    endpoints:  
    - port: metrics    
      interval: 15s

You can use Prometheus to scrape the NVIDIA Triton metrics from all the Kubernetes Pods at Port number 8002, then define a new custom metric based on NVIDIA Triton metrics using PromQL (Prometheus Query language), to tell Prometheus adapter how to collect this custom metric as below:

prometheus-adapter:  
   prometheus:    
      url: http://example-metrics-kube-prome-prometheus.default.svc.cluster.local    
      port: 9090    
   rules:
      custom:
          - seriesQuery: 'nv_inference_queue_duration_us{namespace="default",pod!=""}'
            resources:
               overrides:
                   namespace:
                       resource: "namespace"
                   pod:
                       resource: "pod"
            name:
                matches: "nv_inference_queue_duration_us"
                as: "avg_time_queue_ms"
            metricsQuery: 'avg(delta(nv_inference_queue_duration_us{<<.LabelMatchers>>}[30s])/(1+delta(nv_inference_request_success{<<.LabelMatchers>>}[30s]))/1000) by (<<.GroupBy>>)’

You can use two NVIDIA Triton metrics to define the custom metric avg_time_queue_ms which means the average queue time per inference request in the past 30 seconds and HPA decides whether to change the replica number based on it.

nv_inference_request_success[30] is the number of successful inference requests in the past 30 seconds.
nv_inference_queue_duration_us is the cumulative inference queuing duration in microseconds.

Now that you have a running copy of Prometheus that is monitoring your application, you need to deploy the Prometheus adapter, which knows how to communicate with both Kubernetes and Prometheus, acting as a translator between the two. If you include the prometheus-adapter section in chart.yaml, the Prometheus adapter will be deployed automatically.

The Prometheus adapter helps you to leverage the metrics collected by Prometheus and use them to make scaling decisions. These metrics are exposed by an API service and can be readily used by HPA. HPA can use the custom metric to automatically scale (up and down) the replica number of Kubernetes Pods based on the volume of the inference requests. If the value of the custom metric is higher than the desired value, HPA can increase the number of replicas to reduce the queue time.

In hpa.yaml and value.yaml, you need to specify the maximum and minimum numbers of replicas for deployment. In value.yaml, you can use pods for metrics.type to take the average of the given metrics across all the Pods controlled by the autoscaling target. The target or desired metric value can be set at any number based on your requirement for the response time.

The selection of a relevant metric used by the HPA depends on the service level requirement in your organization, for example, you might also want to sacrifice response time in favor of overall device utilization, to minimize the cost of application deployment. In this example, the response time to inference requests is selected as the primary metric for the HPA.

hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:  
     name: riva-hpa  
     namespace: {{ .Release.Namespace }} 
     labels:    
        app: {{ template "riva-server.name" . }}    
        chart: {{ template "riva-server.chart" . }}    
        ……
spec:  
    scaleTargetRef:    
        apiVersion: apps/v1    
        kind: Deployment    
        name: {{ template "riva-server.fullname" . }}  
    minReplicas: {{ .Values.autoscaling.minReplicas }} 
    maxReplicas: {{ .Values.autoscaling.maxReplicas }}  
    metrics: {{ toYaml .Values.autoscaling.metrics | nindent 2}}


values.yaml
autoscaling:  
    minReplicas: 1  
    maxReplicas: 3  
    metrics:    
         - type: Pods      
           pods:        
               metric:          
                    name: avg_time_queue_ms        
               target:         
                   type: AverageValue          
                   averageValue: 500m

Load balance of Riva deployment with Traefik

You also need a load balancer to help spread the inference requests among all the Riva servers evenly, based on the load pressure. Generally speaking, there are two types of load balancers: Layer 4 and Layer 7. Layer4 load balancers reside on the Transport layer of the network and are agnostic of the actual service (such as HTTP in the case of gRPC). Layer7 load balancers reside on the Application layer and usually proxy the requests for a specific service to the endpoints.

This example uses the Traefik ingress controller and load balancer, which is Layer 7, to receive the inference requests and then distribute them across all the Riva servers. Register the IngressRoute kind in the Kubernetes cluster and use h2c protocol for GRPC in Traefik, as shown in ingressroute.yaml below.

apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:  
     name: {{ template "riva-server-ingressroute.name" . }}  
     namespace: {{ .Release.Namespace }}  
     labels:    
        app: {{ template "riva-server.name" . }}    
        chart: {{ template "riva-server.chart" . }}   
        release: {{ .Release.Name }}    
        heritage: {{ .Release.Service }}
spec:  
    entryPoints:    
         - riva-grpc  
    routes:    
        - match: PathPrefix(`/`)      
          kind: Rule      
          services:        
              - name: {{ template "riva-server.fullname" . }}          
                port: 50051          
                scheme: h2c

Try Riva deployment

Finally, you can put everything together, enabling Riva clients to send inference requests to Riva servers. For Riva servers, everything can be installed using a Helm chart:

$ helm install <name>  .
$ kubectl get svc
NAME                  TYPE           CLUSTER-IP      EXTERNAL-IP       PORT(S)                              AGE
riva-traefik    LoadBalancer      X.X.X.X             <pending>     50051:32067/TCP,80:30082/TCP   69s

You can test Riva client by running Riva streaming ASR Client, provided in the NGC Riva Client image. Ask Riva client to send streaming inference requests to Riva servers, enabling HPA to autoscale (up or down) the number of Kubernetes Pods.

$ riva_streaming_asr_client --riva_uri=<IP address of load balancer>:50051 --audio_file /work/wav/en-US_sample.wav --num_parallel_requests=32 --num_iterations=1024

You can also query NVIDIA Triton metrics from Prometheus and visualize the results in time series by navigating to its metrics endpoint: localhost:9090/metrics.

Screenshot of graphic user interface for Prometheus showing the custom metric in time series. — *Figure 2. Prometheus graphic user interface showing the custom metric in time series based on the user’s query*

Conclusion

This post provides the step-by-step instructions and code to autoscale NVIDIA Riva deployment in a Kubernetes environment. It also shows you how to balance the workload using Traefik. View all the steps and results.

To learn more, watch the Autoscaling Riva Deployment with Kubernetes for Conversational AI in Production GTC 2022 session. For more information about deploying AI inference workloads on a large scale with Kubernetes and NGINX Plus (as another load balancer option), see Deploying NVIDIA Triton at Scale with MIG and Kubernetes.