Boost Llama Model Performance on Microsoft Azure AI Foundry with NVIDIA TensorRT-LLM

Microsoft, in collaboration with NVIDIA, announced transformative performance improvements for the Meta Llama family of models on its Azure AI Foundry platform. These advancements, enabled by NVIDIA TensorRT-LLM optimizations, deliver significant gains in throughput, reduced latency, and improved cost efficiency, all while preserving the quality of model outputs.

With these improvements, Azure AI Foundry customers can achieve significant throughput gains: a 45% increase for the Llama 3.3 70B and Llama 3.1 70B models and a 34% increase for the Llama 3.1 8B model in the serverless deployment (Model-as-a-Service) offering in the model catalog.

Faster token generation speeds and reduced latency make real-time applications like chatbots, virtual assistants, and automated customer support more responsive and efficient. This translates into better price-performance ratios, significantly reducing the cost per token for LLM-powered applications.

The model catalog in Azure AI Foundry simplifies access to these optimized Llama models by eliminating the complexities of infrastructure management. Developers can deploy and scale models effortlessly using serverless APIs with pay-as-you-go pricing, quickly enabling large-scale use cases without upfront infrastructure costs.

Azure’s enterprise-grade security ensures that customer data remains private and protected during API usage.

By combining NVIDIA accelerated computing with Azure AI Foundry’s seamless deployment capabilities, developers and businesses can scale effortlessly, reduce deployment costs, and lower total cost of ownership (TCO), while maintaining the highest standards of quality and reliability.

NVIDIA TensorRT-LLM optimizations drive performance gains

Microsoft and NVIDIA engaged in a deep technical collaboration to optimize the performance of the Llama models. Central to this collaboration is the integration of NVIDIA TensorRT-LLM as the backend for serving these models within Azure AI Foundry.

Initial efforts focused on the Llama 3.1 70B Instruct, Llama 3.3 70B Instruct, and Llama 3.1 8B models, where comprehensive profiling and joint engineering uncovered several opportunities for optimization. These efforts yielded a 45% increase in throughput for the 70B models and a 34% increase in throughput for the 8B model, using new optimizations from TensorRT-LLM while preserving model fidelity.

Key enhancements include the GEMM Swish-Gated Linear Unit (SwiGLU) activation Plugin (–gemm_swiglu_plugin fp8), which fuses two General Matrix Multiplications (GEMMs) without biases and the SwiGLU activation into a single kernel, significantly improving computational efficiency for FP8 data on NVIDIA Hopper GPUs.

The Reduce Fusion (–reduce_fusion enable) optimization combines ResidualAdd and LayerNorm operations following AllReduce into a single kernel, improving latency and overall performance, particularly for small batch sizes and token-intensive workloads where latency is critical.

Another major improvement is the User Buffer (–user_buffer) feature introduced in TensorRT-LLM v0.16, eliminating unnecessary memory copies from local to shared buffers in the communication kernel. This optimization greatly enhances inter-GPU communication performance, especially for FP8 precision in large-scale Llama models.

The resulting increase in throughput translates directly to faster token generation and reduced latency, improving overall responsiveness while lowering cost per token for customers. Additionally, resource utilization is significantly optimized, with fusion techniques reducing kernel overhead and improving memory efficiency.

Despite these substantial performance gains, response quality and accuracy remain intact, ensuring that optimizations don’t degrade the model’s output integrity.

The innovations behind these gains, powered by NVIDIA TensorRT-LLM are available to the entire developer community. Developers can leverage the same optimizations to achieve faster, more cost-effective AI inference, enabling more responsive and scalable AI-driven products that can be deployed on NVIDIA accelerated computing platforms anywhere.

Access the performance of NVIDIA-optimized Llama models on Azure AI Foundry

This collaboration between Microsoft and NVIDIA exemplifies co-engineering excellence by combining Microsoft expertise in cloud infrastructure with NVIDIA leadership in AI and performance optimization. Experience these performance improvements firsthand by trying out the Llama model APIs on Azure AI Foundry.

For developers who prefer to customize and deploy their own models while managing infrastructure, Azure offers flexible options to leverage NVIDIA accelerated computing. You can deploy your models on Azure VMs or Azure Kubernetes Service (AKS) with NVIDIA TensorRT-LLM, for similar performance gains while maintaining control over infrastructure and deployment pipeline.

In addition, NVIDIA AI Enterprise, available on the Azure Marketplace, includes TensorRT-LLM as part of its comprehensive suite of AI tools and frameworks, providing enterprise-grade support and optimizations for production deployments.

At NVIDIA GTC 2025, Microsoft and NVIDIA also announced the integration of NVIDIA NIM with Azure AI Foundry. While TensorRT-LLM enables model builders to customize, fine-tune, and optimize the performance of their models on Azure, NVIDIA NIM, a set of easy-to-use microservices, offers pre-optimized AI models and microservices with enterprise-grade support for AI application developers.

Whether you choose Azure AI Foundry’s fully managed MaaS offering or deploy models on your own in Azure AI Foundry, the full-stack NVIDIA accelerated computing platform enables you to build more efficient and responsive AI-powered applications.