借助 NVIDIA TensorRT-LLM 預測解碼，將 Llama 3.3 的推理吞吐量提升 3 倍

隨著近期新增的 Llama 3.3 70B (一種純文本指令調整模型)，Meta 的開放大語言模型 (LLMs) 集合將繼續增長。Llama 3.3 相對于較舊的 Llama 3.1 70B 模型提供了增強的性能，甚至可以在數學、推理、編碼和多語種支持等多項任務中匹配更大、計算成本更高的 Llama 3.1 405B 模型的功能。

NVIDIA TensorRT-LLM 是一款功能強大的推理引擎，可在最新的 LLM 上提供先進的性能，并整合了許多優化措施，可提供出色的 Llama 3.3 70B 推理吞吐量。其中包括在飛批處理、 KV 緩存、自定義 FP8 量化、推測解碼等，可實現快速、經濟高效的 LLM 服務。

動態批處理默認作為運行時配置參數激活，TensorRT-LLM 支持同時批處理多個不同的請求，從而提高服務吞吐量。通過在上下文和生成階段交錯處理請求，動態批處理可在舊請求仍在運行時執行新請求，從而減少延遲并提高 GPU 利用率。已完成的請求將從批處理中移除，從而為下一組請求騰出空間。

緩存先前令牌的鍵值元素的值可避免在為下一組令牌生成階段對這些張量進行昂貴的重新計算。節省的計算量可有效提高吞吐量。但是，隨著批量請求的數量和序列上下文長度的增加，KV 緩存的大小呈線性增長，從而導致內存需求增加。

TensorRT-LLM KV 緩存通過多種優化解決了這些挑戰，包括對分頁 KV 緩存、?量化 KV 緩存?、?循環緩沖區 KV 緩存?和?KV 緩存重復使用?的支持。每項優化都解決了在增大內存大小與避免不必要且昂貴的重新計算之間實現富有挑戰性的平衡這一問題。預測解碼是一種熱門技術，可通過內置驗證來生成輸出的質量，從而實現更快且經濟高效的 LLM 推理。其前提是，在自回歸解碼過程中，生成多個未來 (草稿) 令牌序列比處理單個令牌更高效。目標模型決定接受這些草稿令牌的數量，這比每次迭代生成一個令牌的效率要高得多。TensorRT-LLM 支持越來越多的預測性解碼技術，包括?草稿目標?、?Medusa?、?Eagle?和前瞻性解碼等。

Image shows various optimization techniques supported out of the box by the TensorRT ecosystem of libraries which includes the TensorRT-LLM and TensorRT Model Optimizer libraries to provide better throughput and lower latency enabling fewer resources to serve the same workloads. — *圖 1、適用于高性能深度學習推理的 NVIDIA TensorRT 優化*

在本文中，我們將展示搭載 NVLink 和 NVSwitch 的 NVIDIA HGX H200 平臺以及 TensorRT-LLM 如何在運行最新的 Llama 3.3 70B 模型時實現出色性能。我們介紹了使用 TensorRT-LLM 對 Llama 3.3 70B 進行解碼預測的分步設置。有關其他優化、不同模型和多 GPU 執行的更多信息，請參閱 TensorRT-LLM 示例的完整列表。

通過草稿目標預測解碼實現吞吐量加速

表 1 和圖 2 突出顯示了無草稿模型 (即無預測解碼) 與使用 Llama 3.3 70B 目標模型的各種規模的草稿模型之間的吞吐量 (輸出令牌/秒) 加速。

吞吐量性能 – 輸出 Tokens/秒 One NVIDIA H200 Tensor Core GPU
草稿\|目標模型	Llama 3.2 1B\|Llama 3.3 70B	Llama 3.2 3B\|Llama 3.3 70B	Llama 3.1 8B\|Llama 3.3 70B	Llama 3.3 70B (無草稿模型)
令牌/秒	191.74	151.53	134.38	51.14
加速 (有與無草稿模型對比)	3.55 倍	3.16 倍	2.63 倍	不適用

表 1、使用一個 NVIDIA H200 Tensor Core GPU 和 TensorRT-LLM 內部測量的吞吐量性能

數據測量于 2024 年 12 月 11 日。輸出 tokens/second 包含生成第一個 token 的時間 – tok/s = 總生成 tokens / 總延遲。DGX H200、TP1、FP8，批量大小 = 1，TensorRT Model Optimizer 版本 0.21，TensorRT-LLM 版本 0.15.0。

A bar chart shows the difference in throughput performance when using the Llama 3.3 70B target model with draft models of different sizes compared to using no draft model (that is, no speculative decoding). — *圖 2、使用 Llama 3.3 70B 目標模型的預測解碼提高吞吐量*

我們提供了在 TensorRT-LLM 中使用草稿目標預測解碼重現這些性能提升的步驟。

# Download the following model checkpoints from Hugging Face and store them 
in a directory for easy access through the setup process.
 
git lfs install
 
# Download target models
git clone https://huggingface.co/meta-llama/Meta-Llama-3.3-70B-Instruct
 
# Download draft models
git clone https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct
git clone https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

下載模型檢查點后，安裝 TensorRT-LLM。

# Obtain and start the basic docker image environment (optional).
docker run --rm --ipc=host --runtime=nvidia --gpus all --entrypoint 
/bin/bash -it nvidia/cuda:12.5.1-devel-ubuntu22.04
 
# Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin 
libopenmpi-dev git git-lfs
 
# Fetch the library
git clone -b v0.15.0 https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
 
# Install the latest version (corresponding to the main branch) of TensorRT-LLM.
pip3 install tensorrt_llm -U --extra-index-url https://pypi.nvidia.com
 
# Check installation
python3 -c "import tensorrt_llm"

接下來，將下載的模型檢查點編譯到草稿和目標 TensorRT 引擎中。這些引擎經過優化，能夠以最佳的準確性和最高的吞吐量運行推理。

cd examples
 
# Steps to build target and draft models in FP8 precision on 1 H200
 
# Create FP8 checkpoints
 
python3 quantization/quantize.py --model_dir <path to draft model repo> --dtype float16 --qformat fp8 --kv_cache_dtype fp8 
--output_dir /ckpt-draft --calib_size 512 --tp_size 1
 
python3 quantization/quantize.py \
    --model_dir=<path to target model repo> \
    --output_dir=./ckpt-target-70b \
    --dtype=float16 --qformat fp8 --kv_cache_dtype fp8 \
    --calib_size 512 --tp_size 1 
 
# Build draft and target engines
# Important flags for the engine build process:
# --use_paged_context_fmha=enable must be specified since we need KVcache reuse for the draft/target model.
 
# --speculative_decoding_mode=draft_tokens_external and --max_draft_len must be specified for target model.
 
trtllm-build \
    --checkpoint_dir ./ckpt-draft \
    --output_dir=./draft-engine \
    --gpt_attention_plugin float16 \
    --workers 1 \
    --gemm_plugin=fp8 \
    --use_paged_context_fmha=enable \
    --multiple_profiles enable \
    --max_batch_size=32 \
    --max_seq_len=131072
 
trtllm-build \
    --checkpoint_dir=./ckpt-target-70b \
    --output_dir=./target-engine \
    --gpt_attention_plugin float16 \
    --workers 1 \
    --gemm_plugin=fp8 \
    --use_paged_context_fmha=enable \
    --multiple_profiles enable \
    --max_batch_size=32 \
    --max_seq_len=131072 \
    --low_latency_gemm_plugin fp8 \
    --speculative_decoding_mode=draft_tokens_external \
    --max_draft_len 10

最后，在 TensorRT-LLM 中運行投機性解碼。

#Run decoding
 
# Important flags to set during the run process:
#--draft_engine_dir and --engine_dir must be specified for the draft and target engines.
 
#--draft_target_model_config is corresponding to the configuration of 
Draft-Target-Model. As an example, [4,[0],[1],False] means draft_len=4, 
device of draft model is GPU0, device of target model is GPU1, and use 
tokens rather than logits to accept.
 
# Only CPP session (using executor as low-level API) is supported, while 
Python session (--use_py_session) is not supported.
 
# Run with Llama 3.3 70B target model
 
mpirun -n 1 --allow-run-as-root python3 ./run.py \
    --tokenizer_dir <path to draft model repo> \
    --draft_engine_dir ./draft-engine \
    --engine_dir ./target-engine \     
    --draft_target_model_config = "[10,[0,1,2,3,4,5,6,7],[0,1,2,3,4,5,6,7], False]" \
    --kv_cache_free_gpu_memory_fraction=0.35 \
    --max_output_len=1024 \
    --kv_cache_enable_block_reuse \
     
--input_text="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nA 
3-digit integer contains one of each of the digits 1,3 and 5. What is the 
probability that the integer is divisible by 
5.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"
 
# Following is the LLM-generated output:
 
Output [Text 0 Beam 0]: "## Step 1: Determine the total number of 3-digit 
integers that can be formed using the digits 1, 3, and 5.
There are 3! = 6 ways to arrange the digits 1, 3, and 5 to form different 3-digit integers.
 
## Step 2: Identify the condition for a number to be divisible by 5.
A number is divisible by 5 if its last digit is either 0 or 5.
 
## Step 3: Determine the number of arrangements where 5 is the last digit.
Since the digit 5 must be the last digit for the number to be divisible by 
5, we fix the last position with 5. The remaining two positions can be 
filled with the digits 1 and 3 in 2! = 2 ways.
 
## Step 4: Calculate the probability that the integer is divisible by 5.
The probability is the number of favorable outcomes (arrangements where 5 is the last digit) 
divided by the total number of possible outcomes (total arrangements of the digits 1, 3, and 5).
 
## Step 5: Calculate the probability.
Probability = (Number of favorable outcomes) / (Total number of outcomes) = 2 / 6 = 1/3.
 
The final answer is: $\boxed{\frac{1}{3}}$"

要在不使用 speculative decoding 的情況下對吞吐量性能進行基準測試，請執行以下步驟：

# Run throughput benchmark for the 70B model without the draft model
 
trtllm-build --checkpoint_dir ./ckpt-target-70b --output_dir /data/70B-TRT/ 
--gpt_attention_plugin float16 --workers 1 --max_batch_size 32 
--max_seq_len 131072 --use_fused_mlp enable --reduce_fusion enable 
--use_paged_context_fmha enable --multiple_profiles enable --gemm_plugin fp8
 
python3 /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py --output 
token-norm-dist.json --tokenizer /llama-3_3-70b/ token-norm-dist 
--num-requests 1000 --input-mean 500 --input-stdev 0 --output-mean 200 
--output-stdev 0 > /tmp/synthetic.txt
 
trtllm-bench --model <path to target model repo> latency --engine_dir 
/data/70b-TRT/ --dataset /tmp/synthetic.txt

總結?

NVIDIA 與 Meta 合作創建、優化和加速世界領先的開放模型。NVIDIA 支持 Llama，履行我們發展開放社區 AI 模型和軟件的承諾，使用戶能夠自定義和處理自己獨特的工作負載。NVIDIA 通過與開發者、維護人員和基金會合作參與了多個開源項目。

NVIDIA TensorRT-LLM 提供多種功能，用于優化和高效運行不同模型架構的 LLM。這些優化可顯著加快相同硬件的運行速度，減少資源以處理相同工作負載，降低能源成本，并提高總擁有成本。這些 TensorRT 優化通過使用 NVIDIA NIM 微服務的生產就緒型部署提供，可隨時隨地在 NVIDIA 加速的基礎設施 (包括云、數據中心和工作站) 中加速生成式 AI 應用的部署。

借助 NVIDIA TensorRT-LLM 預測解碼，將 Llama 3.3 的推理吞吐量提升 3 倍

通過草稿目標預測解碼實現吞吐量加速

總結?

相關資源

標簽

關于作者

借助 NVIDIA TensorRT-LLM 預測解碼，將 Llama 3.3 的推理吞吐量提升 3 倍

通過草稿目標預測解碼實現吞吐量加速

總結?

相關資源

標簽

關于作者

相關文章

在 NVIDIA TensorRT-LLM 中引入新型 KV 緩存重用優化策略

相關文章

NVIDIA NIM Operator 2.0 借助 NVIDIA NeMo 微服務支持提高 AI 部署效率

選擇您的第一個本地人工智能項目

構建應用程序以安全使用 KV 緩存

聚焦：個人 AI 借助 NVIDIA Riva 為小企業主帶來 AI 接待員

借助代理式 AI 系統推進網絡安全運營