• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Data Science

    Horizontal Autoscaling of NVIDIA NIM Microservices on Kubernetes

    Decorative image of two cartoon llamas in sunglasses.

    As of 3/18/25, NVIDIA Triton Inference Server is now NVIDIA Dynamo.

    NVIDIA NIM microservices are model inference containers that can be deployed on Kubernetes. In a production environment, it’s important to understand the compute and memory profile of these microservices to set up a successful autoscaling plan. 

    In this post, we describe how to set up and use Kubernetes Horizontal Pod Autoscaling (HPA) with an NVIDIA NIM for LLMs model to automatically scale up and down microservices based on specific custom metrics.

    Prerequisites

    To follow along with this tutorial, you need the following list of prerequisites:

    • An NVIDIA AI Enterprise license
      • NVIDIA NIM for LLMs is available for self-hosting under the NVIDIA AI Enterprise License. Deploying NIM for LLMs in your cluster requires generating an NGC API KEY so the Kubernetes cluster can download the container image
    • A Kubernetes cluster version 1.29 or later (we used DGX Cloud Clusters)
    • Admin access to the Kubernetes cluster
    • Kubernetes CLI tool kubectl installed
    • HELM CLI installed

    Setting up a Kubernetes cluster

    The first step in this tutorial is to set up your Kubernetes cluster with the appropriate components to enable metric scrapping and availability to the Kubernetes HPA service. This requires the following components: 

    • Kubernetes Metrics Server
    • Prometheus
    • Prometheus Adapter 
    • Grafana

    Kubernetes Metrics Server

    Metrics Server is responsible for scraping resource metrics from Kubelets and exposes them in Kubernetes API Server through the Metrics API. This is used by both the Horizontal Pod Autoscaler and the kubectl top command. 

    To install the Kubernetes Metric Server, use Helm.

    helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
    helm upgrade --install metrics-server metrics-server/metrics-server

    Prometheus and Grafana

    Prometheus and Grafana are well known tools for scraping metrics from pods and creating dashboards. To install Prometheus and Grafana, use the kube-prometheus-stack Helm chart that includes many different components.

    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
    helm install [RELEASE_NAME] prometheus-community/kube-prometheus-stack

    The Prometheus adapter exposes the scraped metrics from Prometheus in the Kubernetes apiserver through Metrics API. This enables HPA to use custom metrics from pods to make scaling strategies. 

    To install the Prometheus adapter in the same namespace as Prometheus and Grafana, use the following Helm commands:

    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
    helm install <name> prometheus-community/prometheus-adapter -n <namespace>

    Make sure that the Prometheus adaptor is pointing to the correct Prometheus service endpoint. In this case, I had to edit the deployment and correct the URL.

    kubectl edit deployment prom-adapter-prometheus-adapter -n prometheus
    spec:
          affinity: {}
          containers:
          - args:
            - /adapter
            - --secure-port=6443
            - --cert-dir=/tmp/cert
            - --prometheus-url=http://prometheus-prometheus.prometheus.svc:9090
            - --metrics-relist-interval=1m
            - --v=4
            - --config=/etc/adapter/config.yaml
            image: registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.12.0

    If everything is set up right, you should see metrics from Prometheus using the following command:

    kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/*/gpu_cache_usage_perc?selector=app%3Dmeta-llama3-8b"
     
    {"kind":"MetricValueList","apiVersion":"custom.metrics.k8s.io/v1beta1","metadata":{},"items":[{"describedObject":{"kind":"Pod","namespace":"<namespace>","name":"meta-llama3-70b-5db5f7dd89-tvcwl","apiVersion":"/v1"},"metricName":"gpu_cache_usage_perc","timestamp":"2025-01-02T20:13:15Z","value":"1m","selector":null},{"describedObject":{"kind":"Pod","namespace":"<namespace>","name":"meta-llama3-8b-5c6ddbbfb5-dp2mv","apiVersion":"/v1"},"metricName":"gpu_cache_usage_perc","timestamp":"2025-01-02T20:13:15Z","value":"14m","selector":null}]}

    Deploying a NIM microservice

    In this tutorial, you use NIM for LLMs as a microservice to scale, specifically using model meta/llama-3.1-8b-instruct. There are multiple options for deploying a NIM microservice:

    After deployment, you should note the service name and namespace of your NIM for LLMs microservice, as this will be used in many commands.

    NIM for LLMs already exposes a Prometheus endpoint with many interesting metrics. To see the endpoint, use the following commands:

    kubectl -n <namespace> port-forward svc/<service-name> 8080

    From a browser, go to localhost:8080/metrics and look for the specific metric named gpu_cache_usage_perc. In this post, you use this metric as a basis for autoscaling. This metric shows the percent utilization of the KV cache and is reported by the vLLM stack.

    You will use the NIM for LLMs Grafana dashboard to observe these metrics. Download the JSON dashboard and upload it to your Grafana instance. To log in to the dashboard, see the Grafana access instructions

    After loading the NIM for LLMs dashboard, you should see a similar dashboard as Figure 1. (I had both 70b and 8b deployed for that dashboard, hence the double KV cache numbers.)

    This diagram shows a NIM LLM Grafana dashboard that displays multiple time series metric charts, for example, KV Cache Percent Utilization, Time to First Token, and Inter Token Latency.
    Figure 1. NIM for LLMs Grafana dashboard

    Now that you have your observability stack and microservice deployed, you can start generating traffic and observing metrics in the Grafana dashboard. The tool to use for generating traffic is genai-perf

    To run this tool from a pod on your cluster, follow these steps and make sure to install in the same namespace as your NIM for LLMs microservice.

    Create a pod with NVIDIA Triton:

    kubectl run <pod-name> --image=nvcr.io/nvidia/tritonserver:24.10-py3-sdk -n <namespace> --command -- sleep 100000?

    Log into the pod and now you can run the genai-perf CLI:

    kubectl exec --stdin --tty <pod-name> -- /bin/bash
    genai-perf --help

    To send traffic to model meta/llama-3.1-8b-instruct requires genai-perf to download the appropriate tokenizer from Hugging Face. Get the API key from Hugging Face and log in.

    pip install --upgrade huggingface_hub[cli]
    export HF_TOKEN=<hf-token>
    huggingface-cli login --token $HF_TOKEN

    Set up the correct environment variables and generate traffic. For more information about the different parameters, see the genai-perf documentation. The model name and service name must be accurate and reflect your setup.

    export INPUT_SEQUENCE_LENGTH=200
    export INPUT_SEQUENCE_STD=10
    export OUTPUT_SEQUENCE_LENGTH=200
    export CONCURRENCY=10
    export MODEL=meta/llama-3.1-8b-instruct
    genai-perf profile \
                 -m $MODEL  \
                 --endpoint-type chat  \
                 --service-kind openai  \
                 --streaming     -u meta-llama3-8b:8080    \
                 --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH    \            
                 --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \
                 --concurrency $CONCURRENCY    \
                 --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH  \
                 --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH   \
                 --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH  \
                 --extra-inputs ignore_eos:true  \
                 --tokenizer meta-llama/Meta-Llama-3-8B-Instruct  
                 --     -v     --max-threads=256

    For this post, I created multiple traffic generation runs by varying the concurrency number: 100, 200, 300, and 400. From the Grafana dashboard, you can see the KV cache utilization percentage (Figure 2). The KV cache percent utilization is increasing with each concurrency trial, from 9.40% at 100 concurrency all the way to 40.9% at 400 concurrency. You can also change other relevant parameters, such as input and output sequence length, and observe the impact on KV cache utilization. 

    This diagram shows four NIM for LLMs Grafana dashboards for concurrency 100, 200, 300 and 400. The KV Cache Percent Utilization metric increases accordingly from 9.40% to 40.09%.
    Figure 2. Increased KV cache percent utilization with increase in concurrency

    Now that you’ve observed the impact of concurrency on KV cache utilization, you can create the HPA resource. Create the HPA resource to scale based on the gpu_cache_usage_perc metric:

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: gpu-hpa-cache
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: meta-llama3-8b
      minReplicas: 1
      maxReplicas: 10
      metrics:
      - type: Pods
        pods:
          metric:
            name: gpu_cache_usage_perc
          target:
            type: AverageValue
            averageValue: 100m
     
    kubectl create -f hpa-gpu-cache.yaml -n <namespace>
    kubectl get hpa -n <namepsace> -w
    NAME            REFERENCE                   TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
    gpu-hpa-cache   Deployment/meta-llama3-8b   9m/100m   1         10        1          3m37s

    Run genai-perf at different concurrencies (10, 100, 200) and watch the HPA metric increase:

    kubectl get hpa -n <namepsace> -w
    NAME            REFERENCE                   TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
    gpu-hpa-cache   Deployment/meta-llama3-8b   9m/100m   1         10        1          3m37s
    gpu-hpa-cache   Deployment/meta-llama3-8b   8m/100m   1         10        1          4m16s
    gpu-hpa-cache   Deployment/meta-llama3-8b   1m/100m   1         10        1          4m46s
    gpu-hpa-cache   Deployment/meta-llama3-8b   33m/100m   1         10        1          5m16s
    gpu-hpa-cache   Deployment/meta-llama3-8b   56m/100m   1         10        1          5m46s
    gpu-hpa-cache   Deployment/meta-llama3-8b   39m/100m   1         10        1          6m16s
    gpu-hpa-cache   Deployment/meta-llama3-8b   208m/100m   1         10        1          6m46s
    gpu-hpa-cache   Deployment/meta-llama3-8b   208m/100m   1         10        3          7m1s
    gpu-hpa-cache   Deployment/meta-llama3-8b   293m/100m   1         10        3          7m16s
    gpu-hpa-cache   Deployment/meta-llama3-8b   7m/100m     1         10        3          7m46s

    Check the number of pods and you should see that autoscaling added two new pods:

    kubectl get pods -n <namespace>
    NAME ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? READY ? STATUS? ? ? ? ? ? ? RESTARTS? ? ? AGE
    meta-llama3-8b-5c6ddbbfb5-85p6c? ? ? ? ? ? 1/1 ? ? Running ? ? ? ? ? ? 0 ? ? ? ? ? ? 25s
    meta-llama3-8b-5c6ddbbfb5-dp2mv? ? ? ? ? ? 1/1 ? ? Running ? ? ? ? ? ? 0 ? ? ? ? ? ? 146m
    meta-llama3-8b-5c6ddbbfb5-sf85v? ? ? ? ? ? 1/1 ? ? Running ? ? ? ? ? ? 0 ? ? ? ? ? ? 26s

    HPA also scales down. The time period to wait before scaling down is dictated by the --horizontal-pod-autoscaler-downscale-stabilization flag, which defaults to 5 minutes. This means that scale downs occur gradually, smoothing out the impact of rapidly fluctuating metric values. Wait 5 minutes and check the scale down.

    kubectl get pods -n <namespace>
    NAME ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? READY ? STATUS? ? ? ? ? ? ? RESTARTS? ? ? AGE
    meta-llama3-8b-5c6ddbbfb5-dp2mv? ? ? ? ? ? 1/1 ? ? Running ? ? ? ? ? ? 0 ? ? ? ? ? ? 154m

    Conclusion

    In this post, I described how to set up your Kubernetes cluster to scale on custom metrics and showed how you can scale a NIM for LLMs based on the KV cache utilization parameter. 

    There are many advanced areas to explore further in this topic. For example, many other metrics could also be considered for scaling, such as request latency, request throughput, and GPU compute utilization. You can scale on multiple metrics in one HPA resource and scale accordingly. 

    Another area of interest is the ability to create new metrics using Prometheus Query Language (PromQL) and add them to the configmap of the Prometheus adapter so that HPA can scale.

    Discuss (2)
    +17

    Tags

    人人超碰97caoporen国产