• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Generative AI

    Spotlight: NVIDIA TensorRT-LLM? ??? NAVER Place? SLM Vertical Service ?? ????

    Reading Time: 7 minutes

    NAVER Place??? Place ??? ??? SLM Vertical Service? ???? ???? ???? ??(????, ??, ??)? ???? ???? ????. 

    ? ???? NVIDIA? NAVER? SLM Vertical Service ??? ?? TensorRT-LLM ?? ???? ??? Triton server? ??? ?? ???? ???? ??? ???? ????. ??? ???? ?? ??? ?????. ???? ???? Introduction to NAVER Place AI Development Team? ??????. 

    SLM?? ?? ?? ??(LLM)? ?? ????? ?? ?? ????? ???? ??? ???? ??, ??, ??? ? ?? ????? ?????. SLM? ?? Fine-tuning? ?? ? ?? ??? ??? ???? ????? ?? ??? ??? ?? ?? ??? ??? ? ?? ??? ??? ????.?

    NAVER Place??? ??? ???? ?? ??? in-house dataset? ?? SLM? ???? NAVER Place ????? ???? ??? ??? ??? ??? ?? ??? ?? ???? ??? ???? ????. 

    ??? ??? ??? ??? ???? ?? ???? ???? task, ???? ??? ???? ???? ?? ??? ???? ???? ??? ???? task???.?
    ?? 1: ?? ?? ? ???? ?? ??? UI, ??? ??? ??

    SLM transformer decoder ?? POI ???

    Naver Place??? ???? ?? ??? ??? ???, ?? ?? ?? ????, ??? ?? ??? ??? ??? ????? ???? ?????. ?? ?? ??? ?? ??? ???? ?? ?? POI(Place Of Interest)? ???? ???? ?????. ?? ? ???? ??? ??? ?? POI? ?????, POI ? ??? ???? ???? ???? ???? ??? ????? ? ?????. ? ???? Retrieval ??? ?? ???? SLM transformer decoder? ???? ?? POI? ?? ?? ??? ?? ??? ???? ?? ??? ? ?? ???? ?????. 

    POI matching task?, credit card transaction records ?? receipt images? input?? ??? ??? point of interest? ???? ??????.?
    ?? 2: SLM ????? ???? ???? POI ?? ??? UI, ??? ??? ??

    Inference ?? ???? ?? NVIDIA TensorRT-LLM ?? 

    TensorRT-LLM? NVIDIA GPU??? ??? ??? ??, ?? ???? ????? runtime engine? ???? llm inference ??? ??????. In-flight batching? ???? ???? ?????, auto-regressive ??? ??? ?? paged KV cache, chunked context? ?? ??? ??? ??? ???? ??? ?? ??? ????.  

    ?? LLM inference ????? ?? Throughput, TTFT(Time to First Token), TPOT(Time Per Output Token) ??? ?? ??? ??? ?? ?? ???? ?? ???? ???? ?????. ??? qwen model? ??? ??? input, ouput token len? ?? A100? H100??? TensorRT-LLM? Alternative open-source LLM inference library ?? Throughput ?????. 

    TensorRT-LLM? alternative library ?? QPS ??? ?????. 72B? 7B Qwen models? A100? H100 ? ??? 4?? type(decode-prefill light, prefill heavy, decode heavy, decode-prefill heavy)? task? ??? ??? ?????
    ?? 3: ??? ?? ???? TensorRT-LLM? ?? ?????? QPS ??, ??? ??? ??

    decode-prefill light, prefill heavy, decode heavy, decode-prefill heavy ?? ??? ?? ???? TensorRT-LLM? ??? ?? ? ? ????. ???? SLM? ?? decode heavy task? ?? ?? ??? ???, TensorRT-LLM?? ?? GPU? ?? ???? ??? ???? ?? ?? NVIDIA Hopper architecture?? ??? ?? ?????. ?? ????? ?? ? ? ??? ??? NVIDIA/TensorRT-LLM Github? perfornace overview, TensorRT-LLM engine? ???? ?? ????? Best practices for tuning ? ????? ????.?

    Inference ???: throughput, latency ?? trade-off 

    ? ????? batch size paged KV cache ? in-flight batching ?? memory ??? ?? ???? LLM ????? ???? ?? ??? ?? ?? ??? ??? ????. 

    Batch? ?? 

    LLM ?? ??? ????? batch? ??? ???? ???? ?????? ??? latency? ???? trade-off? ????. ??? ??? ????? ?? ?? TTFT ? TPOT? ???? batch size? ???? ???? ???.

    ???? lower latency? ??? ?? ????? ?? ? ?? batch? ??(upper part of the figure ??). ???? maximum throughput? ?? ?? ?? upper bound batch size? ??(lower part of the figure ??).?
    ?? 4: ?? ??? ?? ??? ? ?? ??? ??(TPOT), ??? ??? ??

    Paged KV cache ? in-flight batching 

    TensorRT-LLM? ????? paged kv cache ??? ??? ??? memory ??? ??? ?? ??? ???? batch? ??? ??? ??? ?? ?? ???? ???? task? ?? ?? latency? ???? task??? ????? ?? ??? ????. 

    In-flight Batching ?? ????? ??? ???? throughput? ???? ??? ?? ??? ???? task?? ? ??? default? ???? ????. 

    ??, ? ?? ?? ???? ??? ??? ??? ?? ????? ??? GPU? ???? ????? ?? latency? ???? ???? ?? ???? ? ? ??? off ?? ?? ??? ??? ?????. ?? ??, POI ??? ?? ????? ??? ???? ?? ????? ?? latency? ???????. ??? ???? ?? 1.3B? ??? ???? GPU ?? ??? ?? ??? ?? ????? T4? ???? ??? batch? ??? 1? ???? ?? latency? ?? ? ???, batching option? off?? ????.?

    ?? 1.3B ??? ?? model? batch size 1? ??? ?? paging overhead? compute overhead?? ?? latency? qps? ??? ?????. ? ?? ?? ??? ?? batch? ?? ?? 1??? ??? memory overhead ???? ??? ????? paged kv cache? ?????? contiguous kv cache? ??? ??????. 

    precision paged kv cache cache blocks input/output max batch_size QPS latency(secs) 
    fp16 on 7110 500/5 6.49 0.154 
    fp16 off 7110 500/5 8.39 0.119 
    Table 1.??? GPU ???? ?? ??? batch ??? ???? ?? ??, Paged KV cache? ???? ?? QPS? latency? ?? ? ? ??? ? ? ????.?

    POI ??? ?? latency? ??? ??? ??? ? ?? throughput? ??? background ?? ???? ???? ??? ?? ??? ?? ???? ???? ?? ???? ????.??

    "build_config": {
            "max_input_len": 512,
            "max_output_len": 32,
            "max_batch_size": 1,
            "max_beam_width": 1,
            "max_num_tokens": 4096,
            ...
            "plugin_config": {
                ...
                "paged_kv_cache": false,
                ...
            }
        }
    "build_config": { 
            "max_input_len": 512, 
            "max_output_len": 32, 
            "max_batch_size": 8, 
            "max_beam_width": 1, 
            "max_num_tokens": 4096, 
            ... 
            "plugin_config": { 
                ... 
     
                "paged_kv_cache": true, 
                ... 
            } 
        } 

    Inference ???: Downstream caching 

    ? ????? ?? ??? ???? ????? ?? ??? ????? ??? ??? ?? ????. prefix caching ? response caching?? ?? ??? ??? ???? ???? ???? ? ????. 

    Prefix caching 

    Downstream task?? ???? ???? prompt?? ?? prefix? ?? ??? ???? ?? prefill? ???? ?? ??? ? ????. ?? ???? ?? trt-llm? prefix caching ??? ?????. ??? ???? memory? ???? ???? ?? ? ????. ??? ??? TensorRT-LLM Github?? ???? how to enable KV cache reuse? ????? ????. 

    ?? TTFT? ?? ??? ? ?? ??? Input? ??? ?? system prompt? ???? ??? ??? ?? task?? ???? ? ??? ? ? ????. ?? ???? ??? ?? ??? ???? ??? ?????? ?? 40?? mulit-step inference? ????, ? step?? prefix? ???? ??? ? ??? ? ? ?????. 

    ??? system prompt? ?? ?? ?? task? ??? ??? ??? cache? LRU? ???? ??? cache ??? ?? ?? ????? ?? ? ??? ???? ???. 

    Response caching 

    Triton server? response caching ???? ????? ?? ??? ??? ? ????. 

    ?? ??, ??, ??? ???? ?? ??? ??? ????, ?? ?? ?? ??? ???? ?????. multi nomial sampling decoding? ?? ????? ???? ??? ??? ??? response caching? ???? ????? ??? ? ????. ????? ???? ?? POI ??? ?? ?? 4~5?? cache hit? ???? ??? ?? ??? 17% ???? ????. ??? ??? Triton Response Cache documentation? ????? ????.??

    Grafana? ??? POI matching service??? response cache hits metric???.
    Figure 5: POI ?? ??? cache hit ??, ??? ??? ??

    Triton? ?? TensorRT-LLM serving 

    TensorRT-LLM? ?? ??? SLM ??? NVIDIA? NVIDIA Triton Inference Server? ???? ????? ????. ?? tokenizing, postprocessing, multi-step ?? ?? ?????? ???? ?? Ensemble model ?? BLS? ??? ? ????. ? ? ? ??? ??? ??? BLS? ???? ??????. ?? ??? ??? ??? Triton BLS? ??? ????? ?????? ???? ?????.?

    ? ??? request/response schema? ?? ??? ?? 

    Triton ??? ????? pb_tensor ???? ???? ??????. ?? ??? LLM ?? ???? ?? ??? BLS ???? ???? ??? ??? ???? ???, ?? ???? ?? pb_tensor? NumPy ??? ???? ?? pb_tensor? ????? ??? ?????.?

    ? ???? ? ?? ????? ????. ??, ? ??? ??? ???? ??? ??? ??? ??? ?? ??? ?? ?? ??? ?????? ???? ???? ?????. ??, BLS? ?? ????? ????? ?? ???? ???, ?? ? ?? ???? ?? ?? ???? ????? ?? ??????. ?? POI ??? ?? ?? ??? ?? ??? ?????? ??? ???. 

    Figure 6: POI ???BLS ?? ?? ?????, ??? ??? ???

    BLS ??? POI ?? ?? ?????? ??? OCR? ???? ??? ??(tokenization, BERT encoder), ?? ??(???? ? retrieval), re-ranking (tokenizer , Reranker model), ??? ????? ?? ??(generator encoder and decoder)?? ???? multi-step process? ????. ?? sequence? pb_tensor? numpy format ?? ??? ??? ??? ???? ??? ???? ???? ?? ???? ?????.? ??? ???? ?? ??? ?? ??? ??????? 

    IO schema ?? ????

    NVIDIA?? ???? Python Dataclass ??? ????, ??? ??? ??? Pydantic? ??? ?? ? ?? ???? ???????. ?? ?? ?? Triton ??? ??? ??? ?? ??? ???? ???? ??? ??? ??????. 

    ?? ??, ??? ?? BlsRequest ???? ???? Triton ??? ?? ??? ??? ???? ??? ??? ????? ???????.?

    # NOTE: Because Triton uses pb_tensor and Numpy objects,
    # it is required to declaratively manage the fields that are not defined as Python default types.
    # For this, we added tTe json_schema_extra field of Pydantic to explicitly manage data types.
    class BlsRequest(TritonFieldModel):
          name: Optional[str] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
          subname: Optional[str] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
          biznum: Optional[str] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
          address: Optional[List[str]] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
          tel: Optional[List[str]] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
          @root_validator(pre=True)
          def check_all_fields_empty(cls, values):
              if not any(bool(v) for v in values.values()):
                  raise ValidationError("All fields cannot be empty", model=cls.__name__)

    ?? ? IO type conversion ??? 

    ? ??? ??? ??? ?? ??? ?????, pb_tensor ? pydantic ? ?? ??? ?? ??? ?? Base Triton Python ???? ????? ???????. ?? ?? ???? ??? ??? ?? ??? ?? ? ?? ??, ??? ???? ??? ??? ? ????.?

    ??? ?? ?????. ? ??? Pydantic Request ??? ?? Triton pb_tensor? ????, ?? ?? ? ??? Pydantic Response ??? ?????.?

    def _infer_model(self, request_model, response_model_cls, model_name, request_model, **infer_kwargs):
         # Converts Pydantic Request to Triton pb_tensors.
         pb_tensors = self.convert_pydantic_to_pb_tensors(request_model, batch_inference)
         # Runs model inference.
         infer_request = pb_utils.InferenceRequest(
             model_name=model_name,
             inputs=pb_tensors,
             requested_output_names=response_model_cls.get_field_names(),
             **infer_kwargs,
         )
         infer_response = infer_request.exec()
         # Converts Triton Response(pb_tensors) to Pydantic Response.
         return self.convert_pb_tensors_to_pydantic(response, response_model_cls)

    ??? _infer_model? ???? ??? ???? ?????. ???? GeneratorRequest? GeneratorResponse ???? ?????, ??? ???? ?? ??? ??? ??? ?? ? ??? ????.?

    def infer_generator(self, request, text_input, max_tokens):
          response_model_cls = schema.GeneratorResponse
          request_model = schema.GeneratorRequest(text_input=text_input, max_tokens=max_tokens)
           return self._infer_model(
              request=request,
              model_name="generator_bls",
              request_model=request_model,
              response_model_cls=response_model_cls,
          )

    BLS business logic ??? ? testability ?? 

    BLS? ???? ???? ??? ???? ??? ??? ??? ???? ??? ???? ??? ???? ??? ???????. ?? ?? ?? ???? ???, ????? ???? ????? ???????. 

    • ???? ??? ? ?? ??? ??: 
      • ?? ? ???? ???? ??? ???? ??? ???? ? ??? ??? ??? ???????. 
      • Triton Runtime ??? Python Runtime?? ????? ?? ????? ???, ?? ???? ?? ? ??? ???? ??? ???????. 
    • BLS? ?? ???:  
      • BLS? ?? ??? E2E ?? ???? ???? ?????. ??? ??? ???? ???? ????, ??? ????? ?????? BLS ??? ??? ???? ? ????. 
    • CI ??: 
      • ???? ? ???? ??? ?? ????? ?? CI ??? ?????? ???????. 
      • ???, ?? ???? ??? ??? ?? ??? ??? ??? ??? ??? ??? ??? ? ????. 

    ??? ??? ??? ??? ?? ??, ?? ????? ??, ??? ?? ??? ????? ??? ????? ???? Triton ?? LLM ?? ?? ???? ???? ?? ??????. 

    ??

    NAVER Place? NVIDIA TensorRT-LLM? ???? LLM ??? ????? ????? NVIDIA Triton Inference Server? ???? ??????. ?? ???? ?? ?? GPU ???? ????? ?? ??? ???? ?? ???????. ? ?? ??? ?? SLM ?? vertical services? ????? ? ???? NAVER Place? ?? ??? ????? ??????. ? ??? ???? ???? ???? vertical ??? ???? ???? ??? ?????. 

    ?? ??

    Discuss (0)
    +4

    Tags

    人人超碰97caoporen国产