• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Generative AI

    NVIDIA TensorRT-LLM ? NVIDIA Triton Inference Server? Meta Llama 3 ?? ??

    Reading Time: 5 minutes

    LLM ?? ??? ??? ? ????? NVIDIA TensorRT-LLM? Meta Llama 3 ?? ???? ?? ??? ?????. ???? ??? ?????? ?? ???? ? ?? ??? Llama 3 8B ? Llama 3 70B? ?? ??? ? ? ????. ?? NVIDIA API ????? ??? ???? NVIDIA ???? ???? API ?????? ?? Llama 3? ???? ??? ? ?? ?? API? ?? NVIDIA NIM?? ??????.

    ?? ?? ??? ?? ??????. ??? ?? ??? ?? ?? ?? ??? ??? ??? ?? ??? ????. C++ ??, KV ??, ?? ????? ??(in-flight batching) ? ???? ???(paged attention)? ?? ??? ???? ?? ?? ??? ??? ? ?? ??? ??? ??? ? ????. ???? ?? ??? ?? ??? ??? ?? ???? ???. TensorRT-LLM? ? ??? ??????.

    TensorRT-LLM? NVIDIA GPU? ?? LLM?? ?? ??? ????? ?? ?? ????????. NeMo? ??? AI ?????? ??, ??? ? ??? ?? ?? ? ?? ??????, TensorRT-LLM ? NVIDIA Triton Inference Server? ???? ??? AI ??? ?????. 

    TensorRT-LLM? NVIDIA TensorRT ? ?? ????? ?????. ???? FlashAttention? ??? ??? ?? ???? ?? ??? LLM ?? ??? ?? ??? ??? ?? ?? ???(MHA)? ???? ????. ?? GPU??? ???? LLM ?? ??? ?? ??? ?? ?? Python API? ?? ? ?? ?? ??? ?? GPU/?? ?? ?????? ?????? ?????.

    ?????? ?? ??? ???? ?? TensorRT-LLM ? Triton Inference Server? ?? Llama 3 8B? ???? ???? ??? ?? ???????.

    ??? ??? ??? ? ?? GPU ??? ???? ?? ???? ??? ?? ????? TensorRT-LLM ??? ?? ??? ?????. 

    ?? ????

    ?? pip ???? OS? ?? ??? ?? TensorRT-LLM ?????? ???? ???????. ?? TensorRT-LLM? ???? ? ?? ?? ? ?????. ?? dockerfile? ???? ???? ??? ?????? ??? ? ????.

    ?? ??? ???? ?????? ???? ???? ??? TensorRT-LLM ??? ??? ?? ???? ?????. 

    git clone -b v0.8.0 https://github.com/NVIDIA/TensorRT-LLM.git
    cd TensorRT-LLM

    ?? ??? ????

    TensorRT-LLM? LLM ??? ?? ????????. ?? ????? ????? ??? ??? ???? ???. ??? ??? Hugging Face Hub ?? NVIDIA NGC? ?? ??????? ??? ? ????. ? ?? ??? NeMo? ?? ??????? ????? ?? ?? ???? ???? ????.

    ? ???? ??? Hugging Face Hub?? 80? ?? ???? Llama 3 ??? ?? ??(instruction-tuned)? ??? ?? ???(? Tokenizer ??)? ???? ?????. ?? ??? ?? ?????? ??? ???? ?????? ? ????? ????? ?? ??? ??? ????? ?? ????.

    git lfs install
    git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct

    ? ??? ????? ?? ????? ?????. ??? ???? HuggingFace? ???? ??? ??? ???????. 

    TensorRT-LLM ???? ??

    ?? ?? ????? ???? TensorRT-LLM? ??? ???? ???????.

    # Obtain and start the basic docker image environment.
    docker run --rm --runtime=nvidia --gpus all --volume ${PWD}:/TensorRT-LLM --entrypoint /bin/bash -it --workdir /TensorRT-LLM nvidia/cuda:12.1.0-devel-ubuntu22.04
    
    # Install dependencies, TensorRT-LLM requires Python 3.10
    apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev
    
    # Install the stable version (corresponding to the cloned branch) of TensorRT-LLM.
    
    pip3 install tensorrt_llm==0.8.0 -U --extra-index-url https://pypi.nvidia.com

    ?? ???

    ????? ?? ??? ?? ???? TensorRT-LLM Python API? ??? ?? ??? ???? ??? TensorRT ???? ????? ????. 

    TensorRT-LLM ??????? ?? ?? ????? ???? ??? Llama ?? ??? ?????. ??? ??? ? ??? ???? ? ???? ???? ? Llama ?? ? ??? ??? ?????.

    # Log in to huggingface-cli
    # You can get your token from huggingface.co/settings/token
    huggingface-cli login --token *****
    
    # Build the Llama 8B model using a single GPU and BF16.
    python3 examples/llama/convert_checkpoint.py --model_dir ./Meta-Llama-3-8B-Instruct \
                --output_dir ./tllm_checkpoint_1gpu_bf16 \
                --dtype bfloat16
    
    trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
                --output_dir ./tmp/llama/8B/trt_engines/bf16/1-gpu \
                --gpt_attention_plugin bfloat16 \
                --gemm_plugin bfloat16

    TensorRT-LLM API? ?? ??? ??? ?? ?? ????? ???? ???? TensorRT ??????? ?? ???? ?????. ??? ??? GPU??? ?? ??? ????? ?? ??? ?????. 

    TensorRT ????? ???? ???? ? ??? ?? ??? ? GPU? ?? ??? ??? ??? ? ????. ?? ????? ?? ??? ?? ?? ??? ????? ?? ??? ??? ???? ??? ??? ???? ?? GPU ??? ???? ? ?? ????? ?? ? ????. 

    ??? TensorRT? ??? ?? ??? NVIDIA CUDA ???? ?? ???? ?????. ?? ?? ??? ????? ?? ????.

    TensorRT ????? ???? ???? ?? ??? ??? ? ??????, FlashAttention? ?? ?? ??? ??? ??? ?? ??? ?? ?????? ?? ???? ??? ? ????. ?? ?? ??? ??? ???? ??? ?????? ????? ??? ? ????. ? ????? FlashAttention? ??? ?? ??? ??? ???? gpt_attention ????? FP32 ???? ?? ??? ???? gemm ????? ????. ?? ?? ??? ?? ??? ???? HuggingFace?? ????? ???? ?? ???? ???? FP16?? ?????.  

    ?? ???? ??? ??? /tmp/llama/8B/trt_engines/bf16/1-gpu ??? ?? ? ?? ??? ?????.

    • rank0.engine? ?? ????? ?? ????, ?? ???? ??? ?? ??? ?? ???? ???? ????.?
    • config.json?? ?? ?? ? ???? ?? ??? ?? ??? ??? ?? ????? ??? ??????? ?? ??? ?????.?

    ?? ??

    ?? ?? ??? ?????? ? ???? ??? ? ? ?????

    ?? ???? ??? ???? ?? ??? ???? ????. TensorRT-LLM?? ?? ??? ???? ?? ???? ?? ???, KV ?? ??, ?? ?? ?? ?? ????? ???? ?? ??? ???? C++ ???? ????. 

    ???? ?? ???? ??? ???? ????? ???? ???? Triton Inference Server? ??? ???? ??? ?? ???? ??? ? ????. 

    ??? ???? ???? ?? ?? ??? ??? ? ????.

    python3 examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/bf16/1-gpu --max_output_len 100 --tokenizer_dir ./Meta-Llama-3-8B-Instruct --input_text "How do I count to nine in French?"

    Triton Inference Server? ??

    ?? ??? ?? Triton Inference Server? ???? LLM? ???? ??? ??? ?? ?? ????. TensorRT-LLM? Triton Inference Server ???? ??? ?? ??? ?? TensorRT-LLM C++ ???? ?????. ???? ?? ?? ???? ?? ???? ???? ????? ?? ? ???? KV ??? ?? ??? ?????. TensorRT-LLM ???? Triton Inference Server? ?? ??? ???? NGC?? ?? ??? ????? ??? ? ????.

    ?? Triton Inference Server? ?? ? ?? ?? ?????? ?? ? ??? ?? ?????? ???? ???. 

    tensorrtllm_backend ??????? ?? ??? all_models/inflight_batcher_llm/ ??? ?? ?? ?????? ??? ???? ????. 

    ? ?????? ?? ?? ????? ?? ??? ?? ????? ?? ?? 4?? ?? ??? ????. preprocessing/ ? postprocessing/ ???? Triton Inference Server python ???? ?? ????? ????. ?? ????? ??? ??? ????? ?? ??? ??? ???? ??? ???? ???? ?? ID ?? ???? ?? ????.?

    tensorrt_llm ??? ??? ???? ?? ??? ??? ?????. ????? ensemble ??? ?? ? ?? ?? ??? ?? ???? Triton Inference Server? ??? ?? ??? ?? ???? ??? ??? ???? ?? ???? ?????.?

    ?? ?? ?????? ????? ?? ???? ???? ??? ?????.

    # After exiting the TensorRT-LLM docker container
    cd ..
    git clone -b v0.8.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
    cd tensorrtllm_backend
    cp ../TensorRT-LLM/tmp/llama/8B/trt_engines/bf16/1-gpu/* all_models/inflight_batcher_llm/tensorrt_llm/1/

    ???? ????? ????? ?? ??? ???? ?? ??? ??? ???? ???. ?? ??? ?? ??? ??? ? KV ??? ?? ??? ??? ???? ????? Tokenizer? ?? ?? ????? ?????? ???.

    #Set the tokenizer_dir and engine_dir paths
    HF_LLAMA_MODEL=TensorRT-LLM/Meta-Llama-3-8B-Instruct
    ENGINE_PATH=tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1
    
    python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
    
    python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
    
    python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
    
    python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
    
    python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

    ?? ?? ????? ????? Triton ??? ??? ? ????. ?? ??(??? ??? GPU ??)? ???? ?? ??? model_repo? ???? ???.

    #Change to base working directory
    cd..
    docker run -it --rm --gpus all --network host --shm-size=1g \
    -v $(pwd):/workspace \
    --workdir /workspace \
    nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3
    
    # Log in to huggingface-cli to get tokenizer
    huggingface-cli login --token *****
    
    # Install python dependencies
    pip install sentencepiece protobuf
    
    # Launch Server
    
    python3 tensorrtllm_backend/scripts/launch_triton_server.py --model_repo tensorrtllm_backend/all_models/inflight_batcher_llm --world_size 1

    ?? ??

    ?? ?? ???? ?? ??? ??? ??? ????? Triton Inference Server ????? ????? ? ??? ????? ??? ?????? HTTP ??? ?? ? ????. 
    ??? ? ??? ?? ?? ???? ??? ???? ?? ??? ???? ????, ??? ???? ?? ? ??? ??? ?? ????? ????? ??? ? ????.

    curl -X POST localhost:8000/v2/models/ensemble/generate -d \
    '{
    "text_input": "How do I count to nine in French?",
    "parameters": {
    "max_tokens": 100,
    "bad_words":[""],
    "stop_words":[""]
    }
    }'

    ??

    TensorRT-LLM? NVIDIA GPU?? ?? ?? ??? ????? ????? ???? ?? ??? ?????. Triton Inference Server? Llama 3? ?? ?? ?? ??? ????? ???? ???? ? ??????. 

    ? ?? ???? ???? Llama 3 ? ?? ?? ?? ?? ??? ??? ? ??? ?? ?? ??? ???? ??? ?????. 

    TensorRT? ???? ?? ? ?? AI ????? ???? NVIDIA AI Enterprise?? ??????? ??, ???, ?? ??? ? ??? ?? ?? ??? AI ??? ?? TensorRT-LLM? ? ?????.

    ??? ?? ??? 

    ?? ???

    GTC ??: NeMo, TensorRT-LLM ? Triton ?? ??? ???? LLM ?? ?? ? ??
    GTC ??: ???? ? ???????? ?? ?? ??? ? ??? LLM ?? ????
    GTC ??: Triton ?? ??? ?? ?? ?? ??, ??? ? ??????
    SDK: TensorRT-ONNX ???
    SDK: TensorRT – MXNet
    ???: GPU ?? ?????? ?? ??

    Discuss (0)
    0

    Tags

    人人超碰97caoporen国产