NAVER Place??? Place ??? ??? SLM Vertical Service? ???? ???? ???? ??(????, ??, ??)? ???? ???? ????.
? ???? NVIDIA? NAVER? SLM Vertical Service ??? ?? TensorRT-LLM ?? ???? ??? Triton server? ??? ?? ???? ???? ??? ???? ????. ??? ???? ?? ??? ?????. ???? ???? Introduction to NAVER Place AI Development Team? ??????.
NAVER Place ??? SLM
SLM?? ?? ?? ??(LLM)? ?? ????? ?? ?? ????? ???? ??? ???? ??, ??, ??? ? ?? ????? ?????. SLM? ?? Fine-tuning? ?? ? ?? ??? ??? ???? ????? ?? ??? ??? ?? ?? ??? ??? ? ?? ??? ??? ????.?
NAVER Place??? ??? ???? ?? ??? in-house dataset? ?? SLM? ???? NAVER Place ????? ???? ??? ??? ??? ??? ?? ??? ?? ???? ??? ???? ????.

SLM transformer decoder ?? POI ???
Naver Place??? ???? ?? ??? ??? ???, ?? ?? ?? ????, ??? ?? ??? ??? ??? ????? ???? ?????. ?? ?? ??? ?? ??? ???? ?? ?? POI(Place Of Interest)? ???? ???? ?????. ?? ? ???? ??? ??? ?? POI? ?????, POI ? ??? ???? ???? ???? ???? ??? ????? ? ?????. ? ???? Retrieval ??? ?? ???? SLM transformer decoder? ???? ?? POI? ?? ?? ??? ?? ??? ???? ?? ??? ? ?? ???? ?????.

Inference ?? ???? ?? NVIDIA TensorRT-LLM ??
TensorRT-LLM? NVIDIA GPU??? ??? ??? ??, ?? ???? ????? runtime engine? ???? llm inference ??? ??????. In-flight batching? ???? ???? ?????, auto-regressive ??? ??? ?? paged KV cache, chunked context? ?? ??? ??? ??? ???? ??? ?? ??? ????.
?? LLM inference ????? ?? Throughput, TTFT(Time to First Token), TPOT(Time Per Output Token) ??? ?? ??? ??? ?? ?? ???? ?? ???? ???? ?????. ??? qwen model? ??? ??? input, ouput token len? ?? A100? H100??? TensorRT-LLM? Alternative open-source LLM inference library ?? Throughput ?????.

decode-prefill light, prefill heavy, decode heavy, decode-prefill heavy ?? ??? ?? ???? TensorRT-LLM? ??? ?? ? ? ????. ???? SLM? ?? decode heavy task? ?? ?? ??? ???, TensorRT-LLM?? ?? GPU? ?? ???? ??? ???? ?? ?? NVIDIA Hopper architecture?? ??? ?? ?????. ?? ????? ?? ? ? ??? ??? NVIDIA/TensorRT-LLM Github? perfornace overview, TensorRT-LLM engine? ???? ?? ????? Best practices for tuning ? ????? ????.?
Inference ???: throughput, latency ?? trade-off
? ????? batch size paged KV cache ? in-flight batching ?? memory ??? ?? ???? LLM ????? ???? ?? ??? ?? ?? ??? ??? ????.
Batch? ??
LLM ?? ??? ????? batch? ??? ???? ???? ?????? ??? latency? ???? trade-off? ????. ??? ??? ????? ?? ?? TTFT ? TPOT? ???? batch size? ???? ???? ???.

Paged KV cache ? in-flight batching
TensorRT-LLM? ????? paged kv cache ??? ??? ??? memory ??? ??? ?? ??? ???? batch? ??? ??? ??? ?? ?? ???? ???? task? ?? ?? latency? ???? task??? ????? ?? ??? ????.
In-flight Batching ?? ????? ??? ???? throughput? ???? ??? ?? ??? ???? task?? ? ??? default? ???? ????.
??, ? ?? ?? ???? ??? ??? ??? ?? ????? ??? GPU? ???? ????? ?? latency? ???? ???? ?? ???? ? ? ??? off ?? ?? ??? ??? ?????. ?? ??, POI ??? ?? ????? ??? ???? ?? ????? ?? latency? ???????. ??? ???? ?? 1.3B? ??? ???? GPU ?? ??? ?? ??? ?? ????? T4? ???? ??? batch? ??? 1? ???? ?? latency? ?? ? ???, batching option? off?? ????.?
?? 1.3B ??? ?? model? batch size 1? ??? ?? paging overhead? compute overhead?? ?? latency? qps? ??? ?????. ? ?? ?? ??? ?? batch? ?? ?? 1??? ??? memory overhead ???? ??? ????? paged kv cache? ?????? contiguous kv cache? ??? ??????.
precision | paged kv cache | cache blocks | input/output | max batch_size | QPS | latency(secs) |
fp16 | on | 7110 | 500/5 | 1 | 6.49 | 0.154 |
fp16 | off | 7110 | 500/5 | 1 | 8.39 | 0.119 |
POI ??? ?? latency? ??? ??? ??? ? ?? throughput? ??? background ?? ???? ???? ??? ?? ??? ?? ???? ???? ?? ???? ????.??
"build_config": {
"max_input_len": 512,
"max_output_len": 32,
"max_batch_size": 1,
"max_beam_width": 1,
"max_num_tokens": 4096,
...
"plugin_config": {
...
"paged_kv_cache": false,
...
}
}
"build_config": {
"max_input_len": 512,
"max_output_len": 32,
"max_batch_size": 8,
"max_beam_width": 1,
"max_num_tokens": 4096,
...
"plugin_config": {
...
"paged_kv_cache": true,
...
}
}
Inference ???: Downstream caching
? ????? ?? ??? ???? ????? ?? ??? ????? ??? ??? ?? ????. prefix caching ? response caching?? ?? ??? ??? ???? ???? ???? ? ????.
Prefix caching
Downstream task?? ???? ???? prompt?? ?? prefix? ?? ??? ???? ?? prefill? ???? ?? ??? ? ????. ?? ???? ?? trt-llm? prefix caching ??? ?????. ??? ???? memory? ???? ???? ?? ? ????. ??? ??? TensorRT-LLM Github?? ???? how to enable KV cache reuse? ????? ????.
?? TTFT? ?? ??? ? ?? ??? Input? ??? ?? system prompt? ???? ??? ??? ?? task?? ???? ? ??? ? ? ????. ?? ???? ??? ?? ??? ???? ??? ?????? ?? 40?? mulit-step inference? ????, ? step?? prefix? ???? ??? ? ??? ? ? ?????.
??? system prompt? ?? ?? ?? task? ??? ??? ??? cache? LRU? ???? ??? cache ??? ?? ?? ????? ?? ? ??? ???? ???.
Response caching
Triton server? response caching ???? ????? ?? ??? ??? ? ????.
?? ??, ??, ??? ???? ?? ??? ??? ????, ?? ?? ?? ??? ???? ?????. multi nomial sampling decoding? ?? ????? ???? ??? ??? ??? response caching? ???? ????? ??? ? ????. ????? ???? ?? POI ??? ?? ?? 4~5?? cache hit? ???? ??? ?? ??? 17% ???? ????. ??? ??? Triton Response Cache documentation? ????? ????.??

Triton? ?? TensorRT-LLM serving
TensorRT-LLM? ?? ??? SLM ??? NVIDIA? NVIDIA Triton Inference Server? ???? ????? ????. ?? tokenizing, postprocessing, multi-step ?? ?? ?????? ???? ?? Ensemble model ?? BLS? ??? ? ????. ? ? ? ??? ??? ??? BLS? ???? ??????. ?? ??? ??? ??? Triton BLS? ??? ????? ?????? ???? ?????.?
? ??? request/response schema? ?? ??? ??
Triton ??? ????? pb_tensor
???? ???? ??????. ?? ??? LLM ?? ???? ?? ??? BLS ???? ???? ??? ??? ???? ???, ?? ???? ?? pb_tensor
? NumPy ??? ???? ?? pb_tensor
? ????? ??? ?????.?
? ???? ? ?? ????? ????. ??, ? ??? ??? ???? ??? ??? ??? ??? ?? ??? ?? ?? ??? ?????? ???? ???? ?????. ??, BLS? ?? ????? ????? ?? ???? ???, ?? ? ?? ???? ?? ?? ???? ????? ?? ??????. ?? POI ??? ?? ?? ??? ?? ??? ?????? ??? ???.

BLS ??? POI ?? ?? ?????? ??? OCR? ???? ??? ??(tokenization, BERT encoder), ?? ??(???? ? retrieval), re-ranking (tokenizer , Reranker model), ??? ????? ?? ??(generator encoder and decoder)?? ???? multi-step process? ????. ?? sequence? pb_tensor? numpy format ?? ??? ??? ??? ???? ??? ???? ???? ?? ???? ?????.? ??? ???? ?? ??? ?? ??? ???????
IO schema ?? ????
NVIDIA?? ???? Python Dataclass ??? ????, ??? ??? ??? Pydantic? ??? ?? ? ?? ???? ???????. ?? ?? ?? Triton ??? ??? ??? ?? ??? ???? ???? ??? ??? ??????.
?? ??, ??? ?? BlsRequest
???? ???? Triton ??? ?? ??? ??? ???? ??? ??? ????? ???????.?
# NOTE: Because Triton uses pb_tensor and Numpy objects,
# it is required to declaratively manage the fields that are not defined as Python default types.
# For this, we added tTe json_schema_extra field of Pydantic to explicitly manage data types.
class BlsRequest(TritonFieldModel):
name: Optional[str] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
subname: Optional[str] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
biznum: Optional[str] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
address: Optional[List[str]] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
tel: Optional[List[str]] = Field(None, json_schema_extra={'data_type': "TYPE_STRING"})
@root_validator(pre=True)
def check_all_fields_empty(cls, values):
if not any(bool(v) for v in values.values()):
raise ValidationError("All fields cannot be empty", model=cls.__name__)
?? ? IO type conversion ???
? ??? ??? ??? ?? ??? ?????, pb_tensor
? pydantic ? ?? ??? ?? ??? ?? Base Triton Python ???? ????? ???????. ?? ?? ???? ??? ??? ?? ??? ?? ? ?? ??, ??? ???? ??? ??? ? ????.?
??? ?? ?????. ? ??? Pydantic Request ??? ?? Triton pb_tensor
? ????, ?? ?? ? ??? Pydantic Response ??? ?????.?
def _infer_model(self, request_model, response_model_cls, model_name, request_model, **infer_kwargs):
# Converts Pydantic Request to Triton pb_tensors.
pb_tensors = self.convert_pydantic_to_pb_tensors(request_model, batch_inference)
# Runs model inference.
infer_request = pb_utils.InferenceRequest(
model_name=model_name,
inputs=pb_tensors,
requested_output_names=response_model_cls.get_field_names(),
**infer_kwargs,
)
infer_response = infer_request.exec()
# Converts Triton Response(pb_tensors) to Pydantic Response.
return self.convert_pb_tensors_to_pydantic(response, response_model_cls)
??? _infer_model
? ???? ??? ???? ?????. ???? GeneratorRequest
? GeneratorResponse
???? ?????, ??? ???? ?? ??? ??? ??? ?? ? ??? ????.?
def infer_generator(self, request, text_input, max_tokens):
response_model_cls = schema.GeneratorResponse
request_model = schema.GeneratorRequest(text_input=text_input, max_tokens=max_tokens)
return self._infer_model(
request=request,
model_name="generator_bls",
request_model=request_model,
response_model_cls=response_model_cls,
)
BLS business logic ??? ? testability ??
BLS? ???? ???? ??? ???? ??? ??? ??? ???? ??? ???? ??? ???? ??? ???????. ?? ?? ?? ???? ???, ????? ???? ????? ???????.
- ???? ??? ? ?? ??? ??:
- ?? ? ???? ???? ??? ???? ??? ???? ? ??? ??? ??? ???????.
- Triton Runtime ??? Python Runtime?? ????? ?? ????? ???, ?? ???? ?? ? ??? ???? ??? ???????.
- BLS? ?? ???:
- BLS? ?? ??? E2E ?? ???? ???? ?????. ??? ??? ???? ???? ????, ??? ????? ?????? BLS ??? ??? ???? ? ????.
- CI ??:
- ???? ? ???? ??? ?? ????? ?? CI ??? ?????? ???????.
- ???, ?? ???? ??? ??? ?? ??? ??? ??? ??? ??? ??? ??? ? ????.
??? ??? ??? ??? ?? ??, ?? ????? ??, ??? ?? ??? ????? ??? ????? ???? Triton ?? LLM ?? ?? ???? ???? ?? ??????.
??
NAVER Place? NVIDIA TensorRT-LLM? ???? LLM ??? ????? ????? NVIDIA Triton Inference Server? ???? ??????. ?? ???? ?? ?? GPU ???? ????? ?? ??? ???? ?? ???????. ? ?? ??? ?? SLM ?? vertical services? ????? ? ???? NAVER Place? ?? ??? ????? ??????. ? ??? ???? ???? ???? vertical ??? ???? ???? ??? ?????.