Spotlight: NVIDIA TensorRT-LLM? ??? NAVER Place? SLM Vertical Service ?? ???

Reading Time: 7 minutes

NAVER Place??? Place ??? ??? SLM Vertical Service? ???? ???? ???? ??(????, ??, ??)? ???? ???? ????.

? ???? NVIDIA? NAVER? SLM Vertical Service ??? ?? TensorRT-LLM ?? ???? ??? Triton server? ??? ?? ???? ???? ??? ???? ????. ??? ???? ?? ??? ?????. ???? ???? Introduction to NAVER Place AI Development Team? ??????.

NAVER Place ??? SLM

SLM?? ?? ?? ??(LLM)? ?? ????? ?? ?? ????? ???? ??? ???? ??, ??, ??? ? ?? ????? ?????. SLM? ?? Fine-tuning? ?? ? ?? ??? ??? ???? ????? ?? ??? ??? ?? ?? ??? ??? ? ?? ??? ??? ????.?

NAVER Place??? ??? ???? ?? ??? in-house dataset? ?? SLM? ???? NAVER Place ????? ???? ??? ??? ??? ??? ?? ??? ?? ???? ??? ???? ????.

??? ??? ??? ??? ???? ?? ???? ???? task, ???? ??? ???? ???? ?? ??? ???? ???? ??? ???? task???.? — *?? 1: ?? ?? ? ???? ?? ??? UI, ??? ??? ??*

SLM transformer decoder ?? POI ???

Naver Place??? ???? ?? ??? ??? ???, ?? ?? ?? ????, ??? ?? ??? ??? ??? ????? ???? ?????. ?? ?? ??? ?? ??? ???? ?? ?? POI(Place Of Interest)? ???? ???? ?????. ?? ? ???? ??? ??? ?? POI? ?????, POI ? ??? ???? ???? ???? ???? ??? ????? ? ?????. ? ???? Retrieval ??? ?? ???? SLM transformer decoder? ???? ?? POI? ?? ?? ??? ?? ??? ???? ?? ??? ? ?? ???? ?????.

POI matching task?, credit card transaction records ?? receipt images? input?? ??? ??? point of interest? ???? ??????.? — *?? 2: SLM ????? ???? ???? POI ?? ??? UI*, *??? ??? ??*

Inference ?? ???? ?? NVIDIA TensorRT-LLM ??

TensorRT-LLM? NVIDIA GPU??? ??? ??? ??, ?? ???? ????? runtime engine? ???? llm inference ??? ??????. In-flight batching? ???? ???? ?????, auto-regressive ??? ??? ?? paged KV cache, chunked context? ?? ??? ??? ??? ???? ??? ?? ??? ????.

?? LLM inference ????? ?? Throughput, TTFT(Time to First Token), TPOT(Time Per Output Token) ??? ?? ??? ??? ?? ?? ???? ?? ???? ???? ?????. ??? qwen model? ??? ??? input, ouput token len? ?? A100? H100??? TensorRT-LLM? Alternative open-source LLM inference library ?? Throughput ?????.

TensorRT-LLM? alternative library ?? QPS ??? ?????. 72B? 7B Qwen models? A100? H100 ? ??? 4?? type(decode-prefill light, prefill heavy, decode heavy, decode-prefill heavy)? task? ??? ??? ????? — *?? 3: ??? ?? ???? TensorRT-LLM? ?? ?????? QPS ??*, *??? ??? ??*

decode-prefill light, prefill heavy, decode heavy, decode-prefill heavy ?? ??? ?? ???? TensorRT-LLM? ??? ?? ? ? ????. ???? SLM? ?? decode heavy task? ?? ?? ??? ???, TensorRT-LLM?? ?? GPU? ?? ???? ??? ???? ?? ?? NVIDIA Hopper architecture?? ??? ?? ?????. ?? ????? ?? ? ? ??? ??? NVIDIA/TensorRT-LLM Github? perfornace overview, TensorRT-LLM engine? ???? ?? ????? Best practices for tuning ? ????? ????.?

Inference ???: throughput, latency ?? trade-off

? ????? batch size paged KV cache ? in-flight batching ?? memory ??? ?? ???? LLM ????? ???? ?? ??? ?? ?? ??? ??? ????.

Batch? ??

LLM ?? ??? ????? batch? ??? ???? ???? ?????? ??? latency? ???? trade-off? ????. ??? ??? ????? ?? ?? TTFT ? TPOT? ???? batch size? ???? ???? ???.

Paged KV cache ? in-flight batching

TensorRT-LLM? ????? paged kv cache ??? ??? ??? memory ??? ??? ?? ??? ???? batch? ??? ??? ??? ?? ?? ???? ???? task? ?? ?? latency? ???? task??? ????? ?? ??? ????.

In-flight Batching ?? ????? ??? ???? throughput? ???? ??? ?? ??? ???? task?? ? ??? default? ???? ????.

??, ? ?? ?? ???? ??? ??? ??? ?? ????? ??? GPU? ???? ????? ?? latency? ???? ???? ?? ???? ? ? ??? off ?? ?? ??? ??? ?????. ?? ??, POI ??? ?? ????? ??? ???? ?? ????? ?? latency? ???????. ??? ???? ?? 1.3B? ??? ???? GPU ?? ??? ?? ??? ?? ????? T4? ???? ??? batch? ??? 1? ???? ?? latency? ?? ? ???, batching option? off?? ????.?

?? 1.3B ??? ?? model? batch size 1? ??? ?? paging overhead? compute overhead?? ?? latency? qps? ??? ?????. ? ?? ?? ??? ?? batch? ?? ?? 1??? ??? memory overhead ???? ??? ????? paged kv cache? ?????? contiguous kv cache? ??? ??????.

precision	paged kv cache	cache blocks	input/output	max batch_size	QPS	latency(secs)
fp16	on	7110	500/5	1	6.49	0.154
fp16	off	7110	500/5	1	8.39	0.119

Table 1.??? GPU ???? ?? ??? batch ??? ???? ?? ??, Paged KV cache? ???? ?? QPS? latency? ?? ? ? ??? ? ? ????.?

POI ??? ?? latency? ??? ??? ??? ? ?? throughput? ??? background ?? ???? ???? ??? ?? ??? ?? ???? ???? ?? ???? ????.??

"build_config": {
        "max_input_len": 512,
        "max_output_len": 32,
        "max_batch_size": 1,
        "max_beam_width": 1,
        "max_num_tokens": 4096,
        ...
        "plugin_config": {
            ...
            "paged_kv_cache": false,
            ...
        }
    }

"build_config": { 
        "max_input_len": 512, 
        "max_output_len": 32, 
        "max_batch_size": 8, 
        "max_beam_width": 1, 
        "max_num_tokens": 4096, 
        ... 
        "plugin_config": { 
            ... 
 
            "paged_kv_cache": true, 
            ... 
        } 
    }

Inference ???: Downstream caching

? ????? ?? ??? ???? ????? ?? ??? ????? ??? ??? ?? ????. prefix caching ? response caching?? ?? ??? ??? ???? ???? ???? ? ????.

Prefix caching

Downstream task?? ???? ???? prompt?? ?? prefix? ?? ??? ???? ?? prefill? ???? ?? ??? ? ????. ?? ???? ?? trt-llm? prefix caching ??? ?????. ??? ???? memory? ???? ???? ?? ? ????. ??? ??? TensorRT-LLM Github?? ???? how to enable KV cache reuse? ????? ????.

?? TTFT? ?? ??? ? ?? ??? Input? ??? ?? system prompt? ???? ??? ??? ?? task?? ???? ? ??? ? ? ????. ?? ???? ??? ?? ??? ???? ??? ?????? ?? 40?? mulit-step inference? ????, ? step?? prefix? ???? ??? ? ??? ? ? ?????.

??? system prompt? ?? ?? ?? task? ??? ??? ??? cache? LRU? ???? ??? cache ??? ?? ?? ????? ?? ? ??? ???? ???.

Response caching

Triton server? response caching ???? ????? ?? ??? ??? ? ????.

?? ??, ??, ??? ???? ?? ??? ??? ????, ?? ?? ?? ??? ???? ?????. multi nomial sampling decoding? ?? ????? ???? ??? ??? ??? response caching? ???? ????? ??? ? ????. ????? ???? ?? POI ??? ?? ?? 4~5?? cache hit? ???? ??? ?? ??? 17% ???? ????. ??? ??? Triton Response Cache documentation? ????? ????.??

Grafana? ??? POI matching service??? response cache hits metric???. — *Figure 5: POI ?? ??? cache hit ??*, *??? ??? ??*

Triton? ?? TensorRT-LLM serving

TensorRT-LLM? ?? ??? SLM ??? NVIDIA? NVIDIA Triton Inference Server? ???? ????? ????. ?? tokenizing, postprocessing, multi-step ?? ?? ?????? ???? ?? Ensemble model ?? BLS? ??? ? ????. ? ? ? ??? ??? ??? BLS? ???? ??????. ?? ??? ??? ??? Triton BLS? ??? ????? ?????? ???? ?????.?

? ??? request/response schema? ?? ??? ??

Triton ??? ????? pb_tensor ???? ???? ??????. ?? ??? LLM ?? ???? ?? ??? BLS ???? ???? ??? ??? ???? ???, ?? ???? ?? pb_tensor? NumPy ??? ???? ?? pb_tensor? ????? ??? ?????.?

? ???? ? ?? ????? ????. ??, ? ??? ??? ???? ??? ??? ??? ??? ?? ??? ?? ?? ??? ?????? ???? ???? ?????. ??, BLS? ?? ????? ????? ?? ???? ???, ?? ? ?? ???? ?? ?? ???? ????? ?? ??????. ?? POI ??? ?? ?? ??? ?? ??? ?????? ??? ???.

*Figure 6: POI ???BLS ?? ?? ?????*, *??? ??? ??*?

BLS ??? POI ?? ?? ?????? ??? OCR? ???? ??? ??(tokenization, BERT encoder), ?? ??(???? ? retrieval), re-ranking (tokenizer , Reranker model), ??? ????? ?? ??(generator encoder and decoder)?? ???? multi-step process? ????. ?? sequence? pb_tensor? numpy format ?? ??? ??? ??? ???? ??? ???? ???? ?? ???? ?????.? ??? ???? ?? ??? ?? ??? ???????

IO schema ?? ????

NVIDIA?? ???? Python Dataclass ??? ????, ??? ??? ??? Pydantic? ??? ?? ? ?? ???? ???????. ?? ?? ?? Triton ??? ??? ??? ?? ??? ???? ???? ??? ??? ??????.

?? ??, ??? ?? BlsRequest ???? ???? Triton ??? ?? ??? ??? ???? ??? ??? ????? ???????.?

# NOTE: Because Triton uses pb_tensor and Numpy objects,
# it is required to declaratively manage the fields that are not defined as Python default types.
# For this, we added tTe json_schema_extra field of Pydantic to explicitly manage data types.
class BlsRequest(TritonFieldModel):
      name: Optional[str] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
      subname: Optional[str] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
      biznum: Optional[str] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
      address: Optional[List[str]] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
      tel: Optional[List[str]] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
      @root_validator(pre=True)
      def check_all_fields_empty(cls, values):
          if not any(bool(v) for v in values.values()):
              raise ValidationError("All fields cannot be empty", model=cls.__name__)

?? ? IO type conversion ???

? ??? ??? ??? ?? ??? ?????, pb_tensor ? pydantic ? ?? ??? ?? ??? ?? Base Triton Python ???? ????? ???????. ?? ?? ???? ??? ??? ?? ??? ?? ? ?? ??, ??? ???? ??? ??? ? ????.?

??? ?? ?????. ? ??? Pydantic Request ??? ?? Triton pb_tensor? ????, ?? ?? ? ??? Pydantic Response ??? ?????.?

def _infer_model(self, request_model, response_model_cls, model_name, request_model, **infer_kwargs):
     # Converts Pydantic Request to Triton pb_tensors.
     pb_tensors = self.convert_pydantic_to_pb_tensors(request_model, batch_inference)
     # Runs model inference.
     infer_request = pb_utils.InferenceRequest(
         model_name=model_name,
         inputs=pb_tensors,
         requested_output_names=response_model_cls.get_field_names(),
         **infer_kwargs,
     )
     infer_response = infer_request.exec()
     # Converts Triton Response(pb_tensors) to Pydantic Response.
     return self.convert_pb_tensors_to_pydantic(response, response_model_cls)

??? _infer_model? ???? ??? ???? ?????. ???? GeneratorRequest? GeneratorResponse ???? ?????, ??? ???? ?? ??? ??? ??? ?? ? ??? ????.?

def infer_generator(self, request, text_input, max_tokens):
      response_model_cls = schema.GeneratorResponse
      request_model = schema.GeneratorRequest(text_input=text_input, max_tokens=max_tokens)
       return self._infer_model(
          request=request,
          model_name="generator_bls",
          request_model=request_model,
          response_model_cls=response_model_cls,
      )

BLS business logic ??? ? testability ??

BLS? ???? ???? ??? ???? ??? ??? ??? ???? ??? ???? ??? ???? ??? ???????. ?? ?? ?? ???? ???, ????? ???? ????? ???????.

???? ??? ? ?? ??? ??:
- ?? ? ???? ???? ??? ???? ??? ???? ? ??? ??? ??? ???????.
- Triton Runtime ??? Python Runtime?? ????? ?? ????? ???, ?? ???? ?? ? ??? ???? ??? ???????.
BLS? ?? ???:
- BLS? ?? ??? E2E ?? ???? ???? ?????. ??? ??? ???? ???? ????, ??? ????? ?????? BLS ??? ??? ???? ? ????.
CI ??:
- ???? ? ???? ??? ?? ????? ?? CI ??? ?????? ???????.
- ???, ?? ???? ??? ??? ?? ??? ??? ??? ??? ??? ??? ??? ? ????.

??? ??? ??? ??? ?? ??, ?? ????? ??, ??? ?? ??? ????? ??? ????? ???? Triton ?? LLM ?? ?? ???? ???? ?? ??????.

??

NAVER Place? NVIDIA TensorRT-LLM? ???? LLM ??? ????? ????? NVIDIA Triton Inference Server? ???? ??????. ?? ???? ?? ?? GPU ???? ????? ?? ??? ???? ?? ???????. ? ?? ??? ?? SLM ?? vertical services? ????? ? ???? NAVER Place? ?? ??? ????? ??????. ? ??? ???? ???? ???? vertical ??? ???? ???? ??? ?????.

?? ??

GTC ??:?Accelerating LLM Applications: Enhancing Performance With NVIDIA NIM?
GTC ??:?Advanced Techniques for Inference Optimization With TensorRT-LLM?
GTC ??:?NCCL: The Inter-GPU Communication Library Powering Multi-GPU AI?
SDK:?TensorRT
SDK:?TensorFlow-TensorRT
SDK:?TensorRT-ONNX Runtime

Spotlight: NVIDIA TensorRT-LLM? ??? NAVER Place? SLM Vertical Service ?? ????

NAVER Place ??? SLM

SLM transformer decoder ?? POI ???

Inference ?? ???? ?? NVIDIA TensorRT-LLM ??

Inference ???: throughput, latency ?? trade-off

Batch? ??

Paged KV cache ? in-flight batching

Inference ???: Downstream caching

Prefix caching

Response caching

Triton? ?? TensorRT-LLM serving

? ??? request/response schema? ?? ??? ??

IO schema ?? ????

?? ? IO type conversion ???

BLS business logic ??? ? testability ??

??

?? ??

Tags

??? ??

Spotlight: NVIDIA TensorRT-LLM? ??? NAVER Place? SLM Vertical Service ?? ????

NAVER Place ??? SLM

SLM transformer decoder ?? POI ???

Inference ?? ???? ?? NVIDIA TensorRT-LLM ??

Inference ???: throughput, latency ?? trade-off

Batch? ??

Paged KV cache ? in-flight batching

Inference ???: Downstream caching

Prefix caching

Response caching

Triton? ?? TensorRT-LLM serving

? ??? request/response schema? ?? ??? ??

IO schema ?? ????

?? ? IO type conversion ???

BLS business logic ??? ? testability ??

??

?? ??

Tags

??? ??

??

Related posts

LLM ?? ?? ?? ? ?? ???? ?? ???? ??

5??? ??? NVIDIA ?? ??? ?? ?? ?? ??

NVIDIA TensorRT-LLM ? NVIDIA Triton Inference Server? Meta Llama 3 ?? ??

NVIDIA TensorRT-LLM?? LoRA LLM ?? ? ??

NVIDIA AI ?? ????? ???? Diffusion XL? ?? ??? ????