NVIDIA TensorRT-LLM ? NVIDIA Triton Inference Server? Meta Llama 3 ?? ??

Reading Time: 5 minutes

LLM ?? ??? ??? ? ????? NVIDIA TensorRT-LLM? Meta Llama 3 ?? ???? ?? ??? ?????. ???? ??? ?????? ?? ???? ? ?? ??? Llama 3 8B ? Llama 3 70B? ?? ??? ? ? ????. ?? NVIDIA API ????? ??? ???? NVIDIA ???? ???? API ?????? ?? Llama 3? ???? ??? ? ?? ?? API? ?? NVIDIA NIM?? ??????.

?? ?? ??? ?? ??????. ??? ?? ??? ?? ?? ?? ??? ??? ??? ?? ??? ????. C++ ??, KV ??, ?? ????? ??(in-flight batching) ? ???? ???(paged attention)? ?? ??? ???? ?? ?? ??? ??? ? ?? ??? ??? ??? ? ????. ???? ?? ??? ?? ??? ??? ?? ???? ???. TensorRT-LLM? ? ??? ??????.

TensorRT-LLM? NVIDIA GPU? ?? LLM?? ?? ??? ????? ?? ?? ????????. NeMo? ??? AI ?????? ??, ??? ? ??? ?? ?? ? ?? ??????, TensorRT-LLM ? NVIDIA Triton Inference Server? ???? ??? AI ??? ?????.

TensorRT-LLM? NVIDIA TensorRT ? ?? ????? ?????. ???? FlashAttention? ??? ??? ?? ???? ?? ??? LLM ?? ??? ?? ??? ??? ?? ?? ???(MHA)? ???? ????. ?? GPU??? ???? LLM ?? ??? ?? ??? ?? ?? Python API? ?? ? ?? ?? ??? ?? GPU/?? ?? ?????? ?????? ?????.

?????? ?? ??? ???? ?? TensorRT-LLM ? Triton Inference Server? ?? Llama 3 8B? ???? ???? ??? ?? ???????.

??? ??? ??? ? ?? GPU ??? ???? ?? ???? ??? ?? ????? TensorRT-LLM ??? ?? ??? ?????.

?? ????

?? pip ???? OS? ?? ??? ?? TensorRT-LLM ?????? ???? ???????. ?? TensorRT-LLM? ???? ? ?? ?? ? ?????. ?? dockerfile? ???? ???? ??? ?????? ??? ? ????.

?? ??? ???? ?????? ???? ???? ??? TensorRT-LLM ??? ??? ?? ???? ?????.

git clone -b v0.8.0 https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM

?? ??? ????

TensorRT-LLM? LLM ??? ?? ????????. ?? ????? ????? ??? ??? ???? ???. ??? ??? Hugging Face Hub ?? NVIDIA NGC? ?? ??????? ??? ? ????. ? ?? ??? NeMo? ?? ??????? ????? ?? ?? ???? ???? ????.

? ???? ??? Hugging Face Hub?? 80? ?? ???? Llama 3 ??? ?? ??(instruction-tuned)? ??? ?? ???(? Tokenizer ??)? ???? ?????. ?? ??? ?? ?????? ??? ???? ?????? ? ????? ????? ?? ??? ??? ????? ?? ????.

git lfs install
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

? ??? ????? ?? ????? ?????. ??? ???? HuggingFace? ???? ??? ??? ???????.

TensorRT-LLM ???? ??

?? ?? ????? ???? TensorRT-LLM? ??? ???? ???????.

# Obtain and start the basic docker image environment.
docker run --rm --runtime=nvidia --gpus all --volume ${PWD}:/TensorRT-LLM --entrypoint /bin/bash -it --workdir /TensorRT-LLM nvidia/cuda:12.1.0-devel-ubuntu22.04

# Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

# Install the stable version (corresponding to the cloned branch) of TensorRT-LLM.

pip3 install tensorrt_llm==0.8.0 -U --extra-index-url https://pypi.nvidia.com

?? ???

????? ?? ??? ?? ???? TensorRT-LLM Python API? ??? ?? ??? ???? ??? TensorRT ???? ????? ????.

TensorRT-LLM ??????? ?? ?? ????? ???? ??? Llama ?? ??? ?????. ??? ??? ? ??? ???? ? ???? ???? ? Llama ?? ? ??? ??? ?????.

# Log in to huggingface-cli
# You can get your token from huggingface.co/settings/token
huggingface-cli login --token *****

# Build the Llama 8B model using a single GPU and BF16.
python3 examples/llama/convert_checkpoint.py --model_dir ./Meta-Llama-3-8B-Instruct \
            --output_dir ./tllm_checkpoint_1gpu_bf16 \
            --dtype bfloat16

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
            --output_dir ./tmp/llama/8B/trt_engines/bf16/1-gpu \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16

TensorRT-LLM API? ?? ??? ??? ?? ?? ????? ???? ???? TensorRT ??????? ?? ???? ?????. ??? ??? GPU??? ?? ??? ????? ?? ??? ?????.

TensorRT ????? ???? ???? ? ??? ?? ??? ? GPU? ?? ??? ??? ??? ? ????. ?? ????? ?? ??? ?? ?? ??? ????? ?? ??? ??? ???? ??? ??? ???? ?? GPU ??? ???? ? ?? ????? ?? ? ????.

??? TensorRT? ??? ?? ??? NVIDIA CUDA ???? ?? ???? ?????. ?? ?? ??? ????? ?? ????.

TensorRT ????? ???? ???? ?? ??? ??? ? ??????, FlashAttention? ?? ?? ??? ??? ??? ?? ??? ?? ?????? ?? ???? ??? ? ????. ?? ?? ??? ??? ???? ??? ?????? ????? ??? ? ????. ? ????? FlashAttention? ??? ?? ??? ??? ???? gpt_attention ????? FP32 ???? ?? ??? ???? gemm ????? ????. ?? ?? ??? ?? ??? ???? HuggingFace?? ????? ???? ?? ???? ???? FP16?? ?????.

?? ???? ??? ??? /tmp/llama/8B/trt_engines/bf16/1-gpu ??? ?? ? ?? ??? ?????.

rank0.engine? ?? ????? ?? ????, ?? ???? ??? ?? ??? ?? ???? ???? ????.?
config.json?? ?? ?? ? ???? ?? ??? ?? ??? ??? ?? ????? ??? ??????? ?? ??? ?????.?

?? ??

?? ?? ??? ?????? ? ???? ??? ? ? ?????

?? ???? ??? ???? ?? ??? ???? ????. TensorRT-LLM?? ?? ??? ???? ?? ???? ?? ???, KV ?? ??, ?? ?? ?? ?? ????? ???? ?? ??? ???? C++ ???? ????.

???? ?? ???? ??? ???? ????? ???? ???? Triton Inference Server? ??? ???? ??? ?? ???? ??? ? ????.

??? ???? ???? ?? ?? ??? ??? ? ????.

python3 examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/bf16/1-gpu --max_output_len 100 --tokenizer_dir ./Meta-Llama-3-8B-Instruct --input_text "How do I count to nine in French?"

Triton Inference Server? ??

?? ??? ?? Triton Inference Server? ???? LLM? ???? ??? ??? ?? ?? ????. TensorRT-LLM? Triton Inference Server ???? ??? ?? ??? ?? TensorRT-LLM C++ ???? ?????. ???? ?? ?? ???? ?? ???? ???? ????? ?? ? ???? KV ??? ?? ??? ?????. TensorRT-LLM ???? Triton Inference Server? ?? ??? ???? NGC?? ?? ??? ????? ??? ? ????.

?? Triton Inference Server? ?? ? ?? ?? ?????? ?? ? ??? ?? ?????? ???? ???.

tensorrtllm_backend ??????? ?? ??? all_models/inflight_batcher_llm/ ??? ?? ?? ?????? ??? ???? ????.

? ?????? ?? ?? ????? ?? ??? ?? ????? ?? ?? 4?? ?? ??? ????. preprocessing/ ? postprocessing/ ???? Triton Inference Server python ???? ?? ????? ????. ?? ????? ??? ??? ????? ?? ??? ??? ???? ??? ???? ???? ?? ID ?? ???? ?? ????.?

tensorrt_llm ??? ??? ???? ?? ??? ??? ?????. ????? ensemble ??? ?? ? ?? ?? ??? ?? ???? Triton Inference Server? ??? ?? ??? ?? ???? ??? ??? ???? ?? ???? ?????.?

?? ?? ?????? ????? ?? ???? ???? ??? ?????.

# After exiting the TensorRT-LLM docker container
cd ..
git clone -b v0.8.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend
cp ../TensorRT-LLM/tmp/llama/8B/trt_engines/bf16/1-gpu/* all_models/inflight_batcher_llm/tensorrt_llm/1/

???? ????? ????? ?? ??? ???? ?? ??? ??? ???? ???. ?? ??? ?? ??? ??? ? KV ??? ?? ??? ??? ???? ????? Tokenizer? ?? ?? ????? ?????? ???.

#Set the tokenizer_dir and engine_dir paths
HF_LLAMA_MODEL=TensorRT-LLM/Meta-Llama-3-8B-Instruct
ENGINE_PATH=tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64

python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

?? ?? ????? ????? Triton ??? ??? ? ????. ?? ??(??? ??? GPU ??)? ???? ?? ??? model_repo? ???? ???.

#Change to base working directory
cd..
docker run -it --rm --gpus all --network host --shm-size=1g \
-v $(pwd):/workspace \
--workdir /workspace \
nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3

# Log in to huggingface-cli to get tokenizer
huggingface-cli login --token *****

# Install python dependencies
pip install sentencepiece protobuf

# Launch Server

python3 tensorrtllm_backend/scripts/launch_triton_server.py --model_repo tensorrtllm_backend/all_models/inflight_batcher_llm --world_size 1

?? ??

?? ?? ???? ?? ??? ??? ??? ????? Triton Inference Server ????? ????? ? ??? ????? ??? ?????? HTTP ??? ?? ? ????.
??? ? ??? ?? ?? ???? ??? ???? ?? ??? ???? ????, ??? ???? ?? ? ??? ??? ?? ????? ????? ??? ? ????.

curl -X POST localhost:8000/v2/models/ensemble/generate -d \
'{
"text_input": "How do I count to nine in French?",
"parameters": {
"max_tokens": 100,
"bad_words":[""],
"stop_words":[""]
}
}'

??

TensorRT-LLM? NVIDIA GPU?? ?? ?? ??? ????? ????? ???? ?? ??? ?????. Triton Inference Server? Llama 3? ?? ?? ?? ??? ????? ???? ???? ? ??????.

? ?? ???? ???? Llama 3 ? ?? ?? ?? ?? ??? ??? ? ??? ?? ?? ??? ???? ??? ?????.

TensorRT? ???? ?? ? ?? AI ????? ???? NVIDIA AI Enterprise?? ??????? ??, ???, ?? ??? ? ??? ?? ?? ??? AI ??? ?? TensorRT-LLM? ? ?????.

??? ?? ???

TensorRT-LLM ?? ?? ?????? ??????.
NVIDIA NeMo ?? ?? ?????? ?? ??? ?????.
TensorRT ? TensorRT-LLM? ??? ???? ?????.
GitHub?? ?? ??, ???? ? ?? ????
???? ??? ?? TensorRT-LLM? ???? NVIDIA NIM? ?? ai.nvidia.com?? ?????.

?? ???

GTC ??: NeMo, TensorRT-LLM ? Triton ?? ??? ???? LLM ?? ?? ? ??
GTC ??: ???? ? ???????? ?? ?? ??? ? ??? LLM ?? ????
GTC ??: Triton ?? ??? ?? ?? ?? ??, ??? ? ??????
SDK: TensorRT-ONNX ???
SDK: TensorRT – MXNet
???: GPU ?? ?????? ?? ??

NVIDIA TensorRT-LLM ? NVIDIA Triton Inference Server? Meta Llama 3 ?? ??

?? ????

?? ??? ????

TensorRT-LLM ???? ??

?? ???

?? ??

Triton Inference Server? ??

?? ??

??

??? ?? ???

?? ???

Tags

??? ??

??

Related posts

?? ???? LLM ???? ?? NVIDIA AI Blueprint ????

NVIDIA NIM?? ??? ?? ?? ?? AI ???? ????

NVIDIA NIM, ?? ??? ???? ???? ??? ?????.

NVIDIA NeMo? ??? ???? ??? LLM ????, 1?

8-bit ??? ???? ???? ???? ??? 2? ? ??? ????? NVIDIA TensorRT