Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding

Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents, these models assist developers with various tasks, including enhancing code, fixing bugs, generating tests, and writing documentation.

To promote the development of open-source LLMs, the Qwen team recently released Qwen2.5-Coder, a family of advanced LLMs for code generation, reasoning, and fixing across popular programming languages. This post explores the benefits of inference optimizations for Qwen2.5-Coder models supported in NVIDIA TensorRT-LLM, and the ease of deployment with NVIDIA NIM for transformative potential and coding efficiency.

Qwen2.5-Coder models

The Qwen2.5-Coder models have achieved state-of-the-art performance across popular academic benchmarks. NVIDIA TensorRT-LLM has optimized three popular models from the Qwen2.5-Coder family—the 1.5B, 7B, and 32B versions—for high throughput and low latency. TensorRT-LLM is a library for fast, efficient LLM inference and includes optimizations such as dynamic inflight batching, KV caching, KV cache reuse, and several speculative decoding techniques, among others.

These optimizations help deliver performance improvements for the Qwen2.5-Coder models on popular programming languages such as Python, C++, Java, Bash, Javascript, TypeScript, and Go, reaching a wider range of developers. This post explores the lookahead decoding optimization and the performance boost it helps achieve. Without any additional training or need for additional draft models, developers can leverage the TensorRT-LLM high-level API to speed up Qwen2.5-Coder inference to generate multiline autocode completion.

Lookahead decoding

Lookahead decoding is a speculative decoding technique that addresses the slow autoregressive nature of LLMs. Each autoregressive decoding step only generates one token at a time, not leveraging the massive parallel processing power of NVIDIA GPUs, leading to low GPU utilization and lower throughput. We’ve previously discussed the throughput boost achievable with draft target speculative decoding, and here we discuss the benefits of leveraging TensorRT-LLM lookahead decoding implementation using the Qwen2.5-Coder models as an example.

Unlike the single-token generation in autoregressive decoding, lookahead decoding generates multiple tokens simultaneously, adequately utilizing the parallel processing capabilities of the GPU, leveraging computation (FLOPs) for latency reduction. Moreover, lookahead decoding doesn’t require a separate draft model that’s needed for draft target speculative decoding.

Each decoding step is divided into two parallel branches, the lookahead branch and the verification branch. Using the Jacobi iteration method, a classic nonlinear systems solver, the lookhead branch performs parallel decoding for future tokens by generating n-grams. The verification branch selects and verifies the promising n-gram candidates generated by the lookahead branch.

The lookahead algorithm is configured using three key parameters: window size (W), n-gram size (N), and verification set size (G).

Window size (W): Represents the lookahead window size, which determines how many future tokens the algorithm attempts to predict in each step. Larger window size enables the model to look further, helping generate more tokens in a single pass. This effectively improves throughput performance while utilizing GPU computation FLOPs efficiently.
N-gram size (N): Represents the size of the n-grams used in the lookahead process. For example, a 5-gram is a contiguous sequence of 5 future tokens. Together with the window size, it creates a fixed-sized, 2D window for the lookahead branch to generate n-grams from the Jacobi iteration trajectory.
Verification set size (G): Represents the maximum number of speculations or candidate n-grams that the algorithm considers in each step for verification. It balances the trade-off between computation efficiency and exploring more possibilities.

Diagram showing lookahead decoding workflow. For each decoding step, (1) Generate one token at each position in the lookahead branch; (2) Verify and accept 3-grams (from the 3-gram pool) with the verification branch; (3) Collect and cache newly generated 3-grams in the pool from lookahead branch trajectories; (4) Update the lookahead branch to maintain a fixed window size. — *Figure 1. Lookahead decoding workflow with (W, N, G) = (5, 3, 2). Image credit: Break the Sequential Dependency of LLM Inference Using Lookahead Decoding*

Lookahead performance greatly depends on the base model, hardware, batch size, sequence length, and the dataset. It is recommended to profile various configurations to find the best (W, N, G) configuration given the setup. Optimal (W, N, G) tuple configuration enables lookahead decoding to deliver improved throughput performance without the need for any additional training, fine-tuning or draft models.

Through our experiments on (W, N, G) configuration values sweep, we achieve 3.6x and 1.6x throughput speedups for Qwen2.5-Coder 7B Instruct and Qwen2.5-Coder 32B Instruct models, respectively. These speedups are measured in throughput (tokens/second) compared to baseline (no lookahead speculative decoding) on NVIDIA H100 Tensor Core GPUs, as shown in Figure 2.

Bar chart showing Qwen2.5-Coder models 3.6x throughput boost on NVIDIA DGX H100 with TensorRT-LLM lookahead decoding. — *Figure 2. Qwen2.5-Coder models throughput boost on NVIDIA DGX H100 with TensorRT-LLM lookahead decoding*

Data measured on 01/30/2025. Inference throughput (output tokens/second) speedups of Qwen2.5-Coder 7B Instruct and Qwen2.5-Coder 32B Instruct models. DGX H100, TP=1 | (W, N, G) = (8, 8, 8) | Qwen2.5-Coder 7B Instruct, TP=2 | (W, N, G) = (15, 15, 15) | Qwen2.5-Coder-32B-Instruct, batch size=1, TensorRT-LLM version 0.15.0?.?

Similar throughput speedups are achieved on NVIDIA H200 Tensor Core GPUs. With their higher memory bandwidth, they also help raise the baseline throughput performance leading to slightly lower speedups as compared to H100 GPUs (Figure 3).

Bar chart showing Qwen2.5-Coder models 3.4x throughput boost on NVIDIA DGX H200 with TensorRT-LLM lookahead decoding. — *Figure 3. Qwen2.5-Coder models throughput boost on NVIDIA DGX H200 with TensorRT-LLM lookahead decoding*

Data measured on 01/30/2025. Inference throughput (output tokens/second) speedups of Qwen2.5-Coder 7B Instruct and Qwen2.5-Coder 32B Instruct models. DGX H200, TP=1 | (W, N, G) = (8, 8, 8) | Qwen2.5-Coder 7B Instruct, TP=2 | (W, N, G) = (15, 15, 15) | Qwen2.5-Coder 32B Instruct, batch size=1, TensorRT-LLM version 0.15.0?.

Steps to run lookahead decoding with TensorRT-LLM

To reproduce these performance gains using lookahead speculative decoding within TensorRT-LLM, follow the steps below.

# Install TensorRT-LLM. (Commands below are for Linux. Refer to TensorRT-LLM docs for Windows)
 
sudo apt-get -y install libopenmpi-dev && pip3 install --upgrade setuptools 
&& pip3 install tensorrt_llm --extra-index-url https://pypi.nvidia.com

Then run lookahead decoding in TensorRT-LLM using the high-level API.?

# Command for Qwen2.5-Coder-7B-Instruct
 
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.llmapi import (LLM, BuildConfig, KvCacheConfig, 
LookaheadDecodingConfig, SamplingParams)
def main():
    """The end user can customize the build configuration with the 
build_config class. # Max draft length is based on (W,N,G) values and 
calculated as: (W + G -1) * (N-1) + ( N<=1 ? 0: N-2)"""
 
    build_config = BuildConfig(max_batch_size = 128,
max_input_len = 2048, 
max_seq_len = 4096,
max_num_tokens = 16384, 
max_draft_len = 111)
    build_config.plugin_config.reduce_fusion = True
    build_config.plugin_config.use_paged_context_fmha = True
    build_config.plugin_config.multiple_profiles = True
     
    # The configuration for lookahead decoding
    lookahead_config = LookaheadDecodingConfig(max_window_size=8,
                                               max_ngram_size=8,
                                              max_verification_set_size=8)
     
    kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.4)
    llm = LLM(model="Qwen/Qwen2.5-Coder-7B-Instruct",
              kv_cache_config=kv_cache_config,
              build_config=build_config,
              speculative_config=lookahead_config)
     
    prompt = """Write a C++ program to find the nth Fibonacci number using 
recursion. Now we define a sequence of numbers in which each number is the 
sum of the three preceding ones. The first three numbers are 0, -1, -1. 
Write a program to find the nth number."""
     
    sampling_params = SamplingParams(lookahead_config=lookahead_config)
     
    output = llm.generate(prompt, sampling_params=sampling_params)
    print(output)
if __name__ == '__main__':
    main()

Summary

Lookahead speculative decoding enables throughput boost on LLMs without any additional training, fine-tuning, or draft models. We presented benchmarked performance improvements on Qwen2.5-Coder models. Visit build.nvidia.com to try the Qwen2.5-Coder models optimized with NVIDIA TensorRT-LLM for free. Qwen2.5-Coder models optimized with TensorRT-LLM have also been packaged as downloadable NVIDIA NIM microservices for ease of deployment.