LLM ?? ??? ??? ? ????? NVIDIA TensorRT-LLM? Meta Llama 3 ?? ???? ?? ??? ?????. ???? ??? ?????? ?? ???? ? ?? ??? Llama 3 8B ? Llama 3 70B? ?? ??? ? ? ????. ?? NVIDIA API ????? ??? ???? NVIDIA ???? ???? API ?????? ?? Llama 3? ???? ??? ? ?? ?? API? ?? NVIDIA NIM?? ??????.
?? ?? ??? ?? ??????. ??? ?? ??? ?? ?? ?? ??? ??? ??? ?? ??? ????. C++ ??, KV ??, ?? ????? ??(in-flight batching) ? ???? ???(paged attention)? ?? ??? ???? ?? ?? ??? ??? ? ?? ??? ??? ??? ? ????. ???? ?? ??? ?? ??? ??? ?? ???? ???. TensorRT-LLM? ? ??? ??????.
TensorRT-LLM? NVIDIA GPU? ?? LLM?? ?? ??? ????? ?? ?? ????????. NeMo? ??? AI ?????? ??, ??? ? ??? ?? ?? ? ?? ??????, TensorRT-LLM ? NVIDIA Triton Inference Server? ???? ??? AI ??? ?????.
TensorRT-LLM? NVIDIA TensorRT ? ?? ????? ?????. ???? FlashAttention? ??? ??? ?? ???? ?? ??? LLM ?? ??? ?? ??? ??? ?? ?? ???(MHA)? ???? ????. ?? GPU??? ???? LLM ?? ??? ?? ??? ?? ?? Python API? ?? ? ?? ?? ??? ?? GPU/?? ?? ?????? ?????? ?????.
?????? ?? ??? ???? ?? TensorRT-LLM ? Triton Inference Server? ?? Llama 3 8B? ???? ???? ??? ?? ???????.
??? ??? ??? ? ?? GPU ??? ???? ?? ???? ??? ?? ????? TensorRT-LLM ??? ?? ??? ?????.
?? ????
?? pip ???? OS? ?? ??? ?? TensorRT-LLM ?????? ???? ???????. ?? TensorRT-LLM? ???? ? ?? ?? ? ?????. ?? dockerfile? ???? ???? ??? ?????? ??? ? ????.
?? ??? ???? ?????? ???? ???? ??? TensorRT-LLM ??? ??? ?? ???? ?????.
git clone -b v0.8.0 https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
?? ??? ????
TensorRT-LLM? LLM ??? ?? ????????. ?? ????? ????? ??? ??? ???? ???. ??? ??? Hugging Face Hub ?? NVIDIA NGC? ?? ??????? ??? ? ????. ? ?? ??? NeMo? ?? ??????? ????? ?? ?? ???? ???? ????.
? ???? ??? Hugging Face Hub?? 80? ?? ???? Llama 3 ??? ?? ??(instruction-tuned)? ??? ?? ???(? Tokenizer ??)? ???? ?????. ?? ??? ?? ?????? ??? ???? ?????? ? ????? ????? ?? ??? ??? ????? ?? ????.
git lfs install
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
? ??? ????? ?? ????? ?????. ??? ???? HuggingFace? ???? ??? ??? ???????.
TensorRT-LLM ???? ??
?? ?? ????? ???? TensorRT-LLM? ??? ???? ???????.
# Obtain and start the basic docker image environment.
docker run --rm --runtime=nvidia --gpus all --volume ${PWD}:/TensorRT-LLM --entrypoint /bin/bash -it --workdir /TensorRT-LLM nvidia/cuda:12.1.0-devel-ubuntu22.04
# Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev
# Install the stable version (corresponding to the cloned branch) of TensorRT-LLM.
pip3 install tensorrt_llm==0.8.0 -U --extra-index-url https://pypi.nvidia.com
?? ???
????? ?? ??? ?? ???? TensorRT-LLM Python API? ??? ?? ??? ???? ??? TensorRT ???? ????? ????.
TensorRT-LLM ??????? ?? ?? ????? ???? ??? Llama ?? ??? ?????. ??? ??? ? ??? ???? ? ???? ???? ? Llama ?? ? ??? ??? ?????.
# Log in to huggingface-cli
# You can get your token from huggingface.co/settings/token
huggingface-cli login --token *****
# Build the Llama 8B model using a single GPU and BF16.
python3 examples/llama/convert_checkpoint.py --model_dir ./Meta-Llama-3-8B-Instruct \
--output_dir ./tllm_checkpoint_1gpu_bf16 \
--dtype bfloat16
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
--output_dir ./tmp/llama/8B/trt_engines/bf16/1-gpu \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16
TensorRT-LLM API? ?? ??? ??? ?? ?? ????? ???? ???? TensorRT ??????? ?? ???? ?????. ??? ??? GPU??? ?? ??? ????? ?? ??? ?????.
TensorRT ????? ???? ???? ? ??? ?? ??? ? GPU? ?? ??? ??? ??? ? ????. ?? ????? ?? ??? ?? ?? ??? ????? ?? ??? ??? ???? ??? ??? ???? ?? GPU ??? ???? ? ?? ????? ?? ? ????.
??? TensorRT? ??? ?? ??? NVIDIA CUDA ???? ?? ???? ?????. ?? ?? ??? ????? ?? ????.
TensorRT ????? ???? ???? ?? ??? ??? ? ??????, FlashAttention? ?? ?? ??? ??? ??? ?? ??? ?? ?????? ?? ???? ??? ? ????. ?? ?? ??? ??? ???? ??? ?????? ????? ??? ? ????. ? ????? FlashAttention? ??? ?? ??? ??? ???? gpt_attention ????? FP32 ???? ?? ??? ???? gemm ????? ????. ?? ?? ??? ?? ??? ???? HuggingFace?? ????? ???? ?? ???? ???? FP16?? ?????.
?? ???? ??? ??? /tmp/llama/8B/trt_engines/bf16/1-gpu ??? ?? ? ?? ??? ?????.
rank0.engine
? ?? ????? ?? ????, ?? ???? ??? ?? ??? ?? ???? ???? ????.?config.json
?? ?? ?? ? ???? ?? ??? ?? ??? ??? ?? ????? ??? ??????? ?? ??? ?????.?
?? ??
?? ?? ??? ?????? ? ???? ??? ? ? ?????
?? ???? ??? ???? ?? ??? ???? ????. TensorRT-LLM?? ?? ??? ???? ?? ???? ?? ???, KV ?? ??, ?? ?? ?? ?? ????? ???? ?? ??? ???? C++ ???? ????.
???? ?? ???? ??? ???? ????? ???? ???? Triton Inference Server? ??? ???? ??? ?? ???? ??? ? ????.
??? ???? ???? ?? ?? ??? ??? ? ????.
python3 examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/bf16/1-gpu --max_output_len 100 --tokenizer_dir ./Meta-Llama-3-8B-Instruct --input_text "How do I count to nine in French?"
Triton Inference Server? ??
?? ??? ?? Triton Inference Server? ???? LLM? ???? ??? ??? ?? ?? ????. TensorRT-LLM? Triton Inference Server ???? ??? ?? ??? ?? TensorRT-LLM C++ ???? ?????. ???? ?? ?? ???? ?? ???? ???? ????? ?? ? ???? KV ??? ?? ??? ?????. TensorRT-LLM ???? Triton Inference Server? ?? ??? ???? NGC?? ?? ??? ????? ??? ? ????.
?? Triton Inference Server? ?? ? ?? ?? ?????? ?? ? ??? ?? ?????? ???? ???.
tensorrtllm_backend ??????? ?? ??? all_models/inflight_batcher_llm/ ??? ?? ?? ?????? ??? ???? ????.
? ?????? ?? ?? ????? ?? ??? ?? ????? ?? ?? 4?? ?? ??? ????. preprocessing/
? postprocessing/
???? Triton Inference Server python ???? ?? ????? ????. ?? ????? ??? ??? ????? ?? ??? ??? ???? ??? ???? ???? ?? ID ?? ???? ?? ????.?
tensorrt_llm
??? ??? ???? ?? ??? ??? ?????. ????? ensemble ??? ?? ? ?? ?? ??? ?? ???? Triton Inference Server? ??? ?? ??? ?? ???? ??? ??? ???? ?? ???? ?????.?
?? ?? ?????? ????? ?? ???? ???? ??? ?????.
# After exiting the TensorRT-LLM docker container
cd ..
git clone -b v0.8.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend
cp ../TensorRT-LLM/tmp/llama/8B/trt_engines/bf16/1-gpu/* all_models/inflight_batcher_llm/tensorrt_llm/1/
???? ????? ????? ?? ??? ???? ?? ??? ??? ???? ???. ?? ??? ?? ??? ??? ? KV ??? ?? ??? ??? ???? ????? Tokenizer? ?? ?? ????? ?????? ???.
#Set the tokenizer_dir and engine_dir paths
HF_LLAMA_MODEL=TensorRT-LLM/Meta-Llama-3-8B-Instruct
ENGINE_PATH=tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
?? ?? ????? ????? Triton ??? ??? ? ????. ?? ??(??? ??? GPU ??)? ???? ?? ??? model_repo? ???? ???.
#Change to base working directory
cd..
docker run -it --rm --gpus all --network host --shm-size=1g \
-v $(pwd):/workspace \
--workdir /workspace \
nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3
# Log in to huggingface-cli to get tokenizer
huggingface-cli login --token *****
# Install python dependencies
pip install sentencepiece protobuf
# Launch Server
python3 tensorrtllm_backend/scripts/launch_triton_server.py --model_repo tensorrtllm_backend/all_models/inflight_batcher_llm --world_size 1
?? ??
?? ?? ???? ?? ??? ??? ??? ????? Triton Inference Server ????? ????? ? ??? ????? ??? ?????? HTTP ??? ?? ? ????.
??? ? ??? ?? ?? ???? ??? ???? ?? ??? ???? ????, ??? ???? ?? ? ??? ??? ?? ????? ????? ??? ? ????.
curl -X POST localhost:8000/v2/models/ensemble/generate -d \
'{
"text_input": "How do I count to nine in French?",
"parameters": {
"max_tokens": 100,
"bad_words":[""],
"stop_words":[""]
}
}'
??
TensorRT-LLM? NVIDIA GPU?? ?? ?? ??? ????? ????? ???? ?? ??? ?????. Triton Inference Server? Llama 3? ?? ?? ?? ??? ????? ???? ???? ? ??????.
? ?? ???? ???? Llama 3 ? ?? ?? ?? ?? ??? ??? ? ??? ?? ?? ??? ???? ??? ?????.
TensorRT? ???? ?? ? ?? AI ????? ???? NVIDIA AI Enterprise?? ??????? ??, ???, ?? ??? ? ??? ?? ?? ??? AI ??? ?? TensorRT-LLM? ? ?????.
??? ?? ???
- TensorRT-LLM ?? ?? ?????? ??????.
- NVIDIA NeMo ?? ?? ?????? ?? ??? ?????.
- TensorRT ? TensorRT-LLM? ??? ???? ?????.
- GitHub?? ?? ??, ???? ? ?? ????
- ???? ??? ?? TensorRT-LLM? ???? NVIDIA NIM? ?? ai.nvidia.com?? ?????.
?? ???
GTC ??: NeMo, TensorRT-LLM ? Triton ?? ??? ???? LLM ?? ?? ? ??
GTC ??: ???? ? ???????? ?? ?? ??? ? ??? LLM ?? ????
GTC ??: Triton ?? ??? ?? ?? ?? ??, ??? ? ??????
SDK: TensorRT-ONNX ???
SDK: TensorRT – MXNet
???: GPU ?? ?????? ?? ??