LLM ?? ?????: ???? ???

Reading Time: 15 minutes

????? ???? ?? ??? ??? ??? ??? ?? ???? ???? ????, ?? ??? ????, ??? ??? ??? ??? ??? ? ????. ??? ????? ??? ???? ??? ?? ?? ?? ???? ???? ??? ???? ? ???? (?? ???? ???). ??? ?? ?? ???? ?? ?? ??(LLM)? ? ??? ????? ??? ?? ????? ?? ? ???, ?? ??? ?? ? ??(?? ????)? ???? ? ?? ?? ??? ??? ? ????.

?? ?????? LLM ???? ?? ??? ??? ? ?? ???? ???? ?? ?????. ??? ????? ????? ??? ???? ??? ?? ???? ??? ??? ???. ?? ???? ?? ??? LLM ??? ???? ?? ??? ??????.

LLM ?? ????

?? ???? ???? ??? ?? LLM (?: GPT-3)? ????? ?? ??? ???? ???? ?? ?? ??? ??? ???? ?? ?????. ??? LLM? ?? ???? ???? ?? ?? ??(?: ??? ?? ? ?? ?? ?? ?? ??)? ??? ??? ?? ?? ??? ???? ?? <end> ??? ??? ??? ?? ??? ?? ????? ?????. ? ?????? ???(prefill) ??? ??? ??? ? ??? ?????.

??? ??? ???? ??? ????(atomic) ????? ?? ?????. ??? ??? ? 4?? ?? ??? ?????. ???? ? ?? ??? ??? ???? ?? ???? ?????.

??? ?? ?? ?? ??

??? ???? LLM? ?? ??? ???? ?? ??(?? ??)? ????, ? ?? ??? “? ??” ? ??? ???? ? ?????. ??? ? ??? ??? ?? ??? ?????, ??? ?? ??? ? ? ?? ??? ?? ???? ?? ??? ???? ????-???? ?????. ?? GPU ???? ????? ??????.

??? ?? ?? ?? ??

??? ???? LLM? ?? ??? ??? ??? ?? ??? ? ?? ??? ?? ????? ?????. ? ?? ?? ??? ?? ??? ?? ?? ??(?? ?)? ?? ??? ???. ?? ??-?? ??? ??? ??? ??? ?? GPU ?? ??? ??? ???? ????. ???(???, ?, ?, ???)? ????? GPU? ???? ??? ?? ??? ????, ??? ??? ??? ?? ??????? ?? ??? ???? ????. ?????, ? ??? ??? ??? ?????.

?? ???? ??? ?? ?? ??? ?? ???? ???? ??? ??, ?? ??? ???? ?? ?? ?? ??? ??? ???? ??? ????.

LLM?? ?? ?? ?????? ??? ? ???, ??? ?? ?? ?? ??? ???? ?? ???? ?? ? ????. ?? ???? ??? ?, ? LLM? ?? ?? ??? ?????? ?? ?? ?????? ???? ?? ???? ?? ? ????. ?? ?? ??? ?? ?? ??? ??? ? ?? ?????.

??

GPU ???? ???? ????? ???? ?? ??? ??? ??? ???? ????. ?? ??? ??? ??? ???? ??? ???? ?? ??? ??? ?????. ? ? ??? GPU? ???? ? ?? ???? ?? ??? ?? ??? ? ?? ??? ? ????.

??? ?? ??? ?? ????? ?? ? ???, ?? ????? ??? ?????? ??? ? ????. ??? ??? ???? ??? ? ? ????? ?-??(KV) ??? LLM ??? ?? ??? ???? ???.

??? ??(??? ?????? ?)? ??????. ??? ? ??? ?? LLM? ?? ?? ?? ?? ??? ??? ? ??, ??? ?? ??? ???? ?????. ????? ??? ?? ??? ?? ? ??? ??? ??? ???? ??, ?? ?? ??? ? ??? ?? ??? ? ????. ? ??? ??? ? ?? ???? ?-???? ??? ?? ??? ???, ?? ???? ??? ???????.

?-?? ??

??? ??? ?? ???? ??? ? ??? KV ?????. ??? ????? ? ?? ???? ?? ??? ????? ? ??? ?? ?? ??? ? ? ?? ??(?? ?? ? ??? ?? ??? KV ?? ? ?? ?? ???? ??? ?? ? KV ?? ??)? ?? ?????.

? ?? ???? ?? ??? ?? ??? ??? ?? ?? ???? ???? GPU ???? ??? ? ????. ? ???? ??? ??? ???? ?? ?? ??? ???? ?? ??? ??? ? ????. ?? ????? ??? ? ???? ?? ??? KV ??? ????.

LLM ??? ????

??? GPU LLM ??? ????? ??? ??? ? ?? ?? ??? ?? ???? KV ?????.

?? ???: ?? ????? ???? ?????. ?? ??, 70? ?? ????? ?? ??(?: Llama 2 7B)? 16?? ???(FP16 ?? BF16)? ??? ?? ? 7B * sizeof(FP16) ~= 14GB? ???? ?????.
KV ??: ?? ??? ??? ?? ??-??? ?? ??? ???? ?????.

??? ???? ??? ??? ? ??? KV ??? ??? ????? ????? ?? ??? ???? ?? ? ????. ?? ??? ??? ???? ???? LLM ????? ???? KV ??? ??? ?????.

Size of KV cache per token in bytes = 2 * (num_layers) * (num_heads * dim_head) * precision_in_bytes

2? ? ?? ??? K ??? V ??? ?????. ????? (num_heads * dim_head)? ?? ?????? hidden_size (?? ??? ??, d_model)? ?????. ??? ?? ??? ????? ?? ?? ?? ?? ?? ???? ?? ? ????.

? ??? ??? ?? ???? ? ??? ?? ?? ?? ??? ?? ?????. ????(Half-precision)?? ??? ?, KV ??? ? ??? ?? ??? ?? ?????.

Total size of KV cache in bytes = (batch_size) * (sequence_length) * 2 * (num_layers) * (hidden_size) * sizeof(FP16)

?? ?? 16?? ???, ?? ??? 1? Llama 2 7B ??? ?? KV ?? ??? 1 * 4096 * 2 * 32 * 4096 * 2???? ~2GB? ???.

? KV ??? ????? ???? ?? ??? ?????. ?? ??? ??? ??? ?? ????? ???? ??? ??? ?? ??? ??? ??? ? ????. ??? ??? ? ?? ???? ???? ? ???? ??? ?? ??? ?????. ?? ???? ??? ? ?? ??? ??? ??? ?? ? ?????.

LLM? ?? ???? ?? ???? ??

?? ???? ??? ??? ????? ??? ? ?? ??? ??? ?? GPU? ???? ????. ???? ?? ??? ???? ? ? ?? ?? ? ?? ?? ??? ??? ? ????. ?? ???? ?? ???? ??? ? ?? ??? ? ?? ???? ??? ??? ?? ?? ???? ?? ?? ??? ??? ?? ?? ? ?? ???(?? ?? ?? ???)? ??? ?? ??? ?????. ?? ???? ???? ??? ?? ??? ????? ???? ?? ??? ????.

??? ???? ??? ??? ?? ???? ?? ???? ?? ???? ????? ???. ? ?? ??? ???? ?? ??? ???? ??? (??) ?? ??? ? ??? ?? ???? ??? ?????. ? ? ??? ???? ?? ?? ??? ?????. ??? ???? ???? ???? ?? ?? ??????.

????? ???

????? ???? ??? (?????) ??? ???? ??? ????, ? ??? ??? ???? ???? ???? ????? ?????. ?? 2a? ??? ????? ???? ?? ???? 1/4 ???? ? ???? ???? 4?? ????? ?? ??? ?????. ? ????? ?? ??? ???? ?? ????? ????, ?? ????? ?? ??? ?? ?????. ? ? ?? ?? n?? ??? ? ??? ??? ?????. ? ??? ?? ???? ???? ? ??? ??? ?? ??? ????? 4?? 1? ?????.

? ??? ?? ??? ??? ??? ???? ?? ?? ????? ???? ?? ???? ??(???, ?????)? ???? ?? ?? ??? ???? ? ??? ????. ?? ?? ??? ? ??? ?? ???? ???? ?? “????? ??”? ?????. ?? 2b?? ?? ? ??? ????? ?? ???? ???? ?? ????? ?? ??? ?? ? ????? ?????.

?? 2c?? ? ? ??? ??????? ? ??? ?? ?? ??? ? ????. ??? ?? ?? ??? ?? ??? ???? ??? ???? ???? ?????? ?????. ??????? ?? ?????? $F_{n,m}$ ? $B_{n,m}$ ? ?? ??? ? ??? ??? ?????. ? ?? ??? ????? ??? ??? ???? ??? ??? ????? ????.

*?? 2. 4?? ????? ?? ??? ??. Credit: GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism*

?? ???

?? ???? ??? ?? ???? ?? ?? ???? ??? ? ?? ? ?? ???? ?? ???? (?????) ???? ??? ?????. ??? ??? ??-??? ????(MLP) ???? ?? ???? ??? ? ?? ?????? ?? ?? ?????. ??-?? ??? ????? ? ?? ?? ?? ??? ?? ??? ???? ????? ??? ??? ? ????.

*?? 3. ?? ??? ????(MLP) ? ?? ??? ???? ?? ?? ?? Credit:* *Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism*

?? 3a? 2?? MLP?? ??? ?? ?? ??? ?? ?? ??, ? ??? ?? ??? ???? ????. ? ?? ??? ??? ??? ?? ? ? ?? ????. ?? ? ? ? ?? ?? ???? ?? ? ??? ??(? ????? ??)?? ????? ??? ? ????. ??? ?? ? ??? ???? ???? ? ??? ???? ????? ???? ?????. ?? ?? ? ? ?? ??? ??? ?????.

?? 3b? ?? ??? ????? ??? ?? ?? ??? ?????. ?? ??? ??? ????? ???? ???? ?? ??? ? ????.

??? ???

?? ???? ???? ????? ?? ??? ???? ???? ??? ??? ????. ?? ?? ?? ?? ??? ???? LayerNorm ? Dropout? ?? ???? ???? ????. LayerNorm? Dropout? ?? ??? ?????, (??) ?????? ???? ? ??? ?? ???? ?????.

Reducing Activation Recomputation in Large Transformer Models?? ? ? ??? ??? ??? ?? ??? ???? ?????, ??? ??? ‘???-??’? ?? ??? ? ?? ??? ???? ?? ? ????. ?? ??? ?? ???? ???.

*?? 4. ?? ? ??? ?? ??? ?? ??? ????? ???? ??. Credit: Reducing Activation Recomputation in Large Transformer Models*

?? ???? ?? ??? ????? ??? ?? ??? ? ????. ??? ??? LLM? GPU? ??? ????? ???? ??? ? ??? ? ? ???, ??? ??? ?? ??? ??? ????.

??? ???? ??? ??

???? ?-???? ???(SDPA) ??? Attention Is All You Need?? ??? ?? ?? ? ?-?? ?? ??? ?????.

??-?? ???

SDPA? ??? ????, ??? ???? ??? ?? ?? Q, K, V ??? ????? ??? ?? ? ???? ??? ?? ?? ???? ?? ?? ?? ?? ??? ??? ?? ??? ? ????. ??? ?? ??? ????? ????? ??? ??? ??? ??? ?? ? ???? ??? ? ????.

?? 5? ??? ??? ?? ?? ??? ??? ??? ???? ????? ???? ?????. ??? ?? ??? ???? ‘??’?? ??, ? ?? ??? ?? ?? ???(MHA)?? ???.

?? ???? ? ??? ??? 8?? ?? ??? ??? ??? ? ??? ??? ??(?: $d_{model}/8$ )?? ?????. ??? ?? ?? ??? ??-?? ???? ???? ?????.

*?? 5. ???? ?-???? ???(SDPA, ??)? ??-?? ???(???)? ????, ??? ?? ?? SDPA ??? ??? ??? ??. Credit: Attention Is All You Need*

??-?? ???

Fast Transformer Decoding?? ??? ?? ?? ???(MQA)?? ?? MHA? ?? ?? ??? ? ??? ?? ??? ?? ?? ?? ??? ???? ?????. ?? ??? ??? ????? ??? ?? ? ?????.

MQA?? ???? ??? ?? MHA? ????? ????? ?? ???(?, ??)? ?? ??? ??? ?????. ??? ???? ???? ???? ??? ???? ?? ? ????. ?? ??? ? KV ??? ??? ???? ? ? ?? ??? ?? ??? ??? ? ????.

?-?? ??? ??? ???? ??? ??? ?????. ?? ?? ? ? ???? ???? ?? ??? MQA? ???? ???? ??(?? ??? ???? ~5%? ?? ??)? ???? ???.

?? – ?? ???

Grouped-query attention (GQA)? ?? ??? ? ?? ?? ?? ??? ???? MHA? MQA ?? ??? ????(?? 6). ? ?? ???? ?? ?? ????? ?????.

?? 6? ?? ?? ???? ?? ?? ?-?? ??? ??? ??? ?????(??). ?? ?? ???(???)? ?-?? ??? ???? ??? ?? ?? ???? ?? ??? ?? ??? ?? ?? ?? ??? ????. ??-?? ???(???)? ?? ?-?? ??? ???? ???? ??? ? ????.

*?? 6. ??? ??? ???? ??. Credit: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints*

?? MHA? ??? ??? ?? ?? ???? ??? ???? GQA? ‘?????’? ? ????. MQA? ??? ?? ??? ????? MHA? ??? ??? ?? ? ????. Llama 2 70B ? GQA? ???? ??? ????.

MQA ? GQA? ?? ??? ??? ???? ? ? ?? ??? ?? ????? KV ??? ??? ???? ??? ? ??? ???. ??? ? KV ??? ???? ???? ??? ????? ?? ? ????. ?? ????? ??? ?? ??? ????? ??? ?? ???? ?? ???? KV ?? ??? ?? ??? ?????.

??? ???

??? ????? ????? ? ?? ??? GPU? ??? ?? ??? ? ? ???? ?? ?? ??? ??? ???? ????. ???? ????? ???? ????, ???? ??? ?? ???? ?? ? ?? ? ?? ??? ??? ???? ???? ??? ???? ?????. ? ??? ?? ??? ???? ???? ?? ???, ?? ? ?? ??? ??? ?? ??? ??? ?? ?? ? ?? ??? ???? ?? ??? ? ?? ?????.

?? ?? ?? ?? ??? ???? GPU? ???? ?? ?? ??? ????? ???? ?? ??? ?? ???? ??? ???? ??? ??? ?? ???? ? ????.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness? ??? ??? ?? I/O ???? ??? ??? ????? ??? ???? ?? ?? ?? ?? ?? ? ?????. ??? ???? ?? ??-?? ???(?? ?? ? ?? ?? ???? ??? ? ?? ?? ??)? ????? ????? ?? ?? ???? ?? ?? ??? ??? ?? ?? ??? ? ??? ?????.

I/O ????(I/O aware)? ?? ??? ??? ? ?? ??? ??? ?? ??? ?? ????? ?????. ??, ??? ???? ?? ??? ?? ??? ????? ???? ? ??? ?? ?? ???? ??, ?? ??? ?? ??? ? ?? ??? ???? ???? ‘???’ ??? ?????.

?? 7? 40GB GPU? ??? ?????? ?? ??? ??? ????? ?????. ??? ??? ??? ????? ?? ?? ??? ???? ????? ?? ? ?? ???? ?? ??? ?????.

???? ?? KV ??? ???? ??

??? ??? ??? ??? ? ?? ??? ??? ?? ??(???? ??? ??)? ???? ?? KV ??? ???? “?? ?????”?? ??? ????. ?? ??, ??? ???? ?? ??? ??? 2,048? ??, ?? ??? ???? ??? ??? ???? ???? 2,048 ??? ??? ??????. ? ??? ????? ??? ? ???, ?? ?? ??? ???? ?? ? ?? ?? ??? ?? ?? ???? ??? ? ????. ? ??? ??? ??? ?? ?? ?? ?? ????.

*?? 8. ???????? ????? KV ?? ??? ?? ??? ?? ? ???? ??. Credit: Efficient Memory Management for Large Language Model Serving with PagedAttention*

?? ??? ????? ??? ?? PagedAttention ????? ???? ?? ??? ???? ??? ??? ??? ? ?? ????. ? ????? ? ??? KV ??? ??? ?? ??? ???? ???? ???? ?????? ??? ? ????.

??? ??? ??? ?? ?? ??? ?? ??? ???? ?? ???? ???? ?????. ??? ??? ???? ??? ??? ?????. ??? ??? ??? ???? ?? ?? ?? ??? ?? ?? ??? ???? ?? ??? ?? ???? ???? ?????. ?? ??? ??? ?? ???? ? ? ?? ??(????? ???)? ???? ???.

?? ??? ???

???? LLM? ???? ???? ??? ??, ???? ?? GPU? ??? ? ?? ? ?? ??, ??? ???? ? KV ?? ???? ?? ??????. ?? ?? ??? ??? ???? ? GPU? ??? ???? ??? ? ?? ?? ??? ??? ????. ?? GPU?? ??? ??? ?? ?? ??? ????? ?? ????? ?? ?? ??? ?? ? ?? ? ????.

??? (Quantization)

???(Quantization)? ??? ???? ???? ???? ??? ???????. ???? ??? 32?? ?? 16??? ???? ????, ? ????? ??? ??? ???? ?? ???? 32?? ?? 16??? ???? ?????. ??? ???? ??? ??? ?? ? 8?? ?? ? ???? ????? ??? ? ????.

?? 9? ??? ? ?? ??? ?? ??? ? ??? ?????. ? ?? ????? ?? ?? ???? ???? ????? ?? ?? ?? ??? ???? ?? ?? ?? ???? ??? ? ????.

??? ???? ??? ?? ?? ??? ?? ? ????. ??? ??? ??? ? ???? ?? ?? ????? ? ? ??? ?? ? ????. ?? ???? ??? ????? ? ?? ????? ??? ? ??? ?? ????? ???? ??? ??? ????? ? ??? ? ? ????.

???, ??? ?? ? ?? ??? ?? ???? ??? LLM? ??? ???? ?? ??? ????. ???? ?? ?? ???? ??? ????? ?? ?? ? ?????. ??? ?????? ? ?? ???? ???? ??? ??? ??? ? ????. GPU?? INT8 ? FP16 ??? ??? ?? ?? ????? ???? ?? ??? ?? ???? ? ?? ???? ?? ???? ???.

???, ????? ?? ? ???? ???? ??? ???? ?? ??? ???? ??? ???? ????. ????? ???? ?? ???? ???? ?? ?? ??? ????? ????? ????? ?? ???? ??? ?? ????? ? ?????.

? ?? ??? ?? ??? ??? ??? ???? ?? ???? ?? ????? ? ?? ???? ????? ???? ??? ???? ??? ???? ?? ??? ?? ????(LLM.int8()). ? ?? ??? ????? ?? ???? ?? ??? ??? ???? ?? ??? ????? ????.

???(Sparsity)

???? ????, ?? ? ?? ??? ???? ??? 0? ??? ?? ?? 0 ??? ???? ? ?? ??? ??????. ?? ??? ?? ??? 0? ???, ?? ?? ???? ??? ? ???? ??? ??? ??? ? ????.

*?? 10. 0? ?? ??? ?? ?? 2?? ????? ??? ?? ???? ???? ?? ??*

?? GPU? ? ?? ? ? ? ?? 0?? ???? ?? ??? ??? ???(structured sparsity)? ?? ???? ?? ??? ??? ????. ?? ?? ??? ???? ???? ?? ??? ?? ? ?? ? ????. ??? ?? ??? ?? ???? ???? ?? ?? ??? ?? ?? ??? ??? ?? ???? ?? ?? ??? ??? ? ?? ??? ??? ?????.

?? (Distillation)

??? ??? ??? ? ?? ?? ??? ???? ????? ?? ??? ? ?? ??? ???? ????. ? ???? ? ?? ??(????? ?)? ? ? ??(??)? ??? ????? ???? ?? ?????.

?? ??? ???? ??? BERT ??? 40% ?????? ?? ?? ??? 97%? 60% ? ?? ??? ???? DistilBERT? ????.

LLM??? ??? ??? ?? ?????, ???? ?? ???? ?? ??? Distilling the Knowledge in a Neural Network?? ?? ??????:

?? ????? ?? ?? ???? ???? ?? ??? ???? ? ? ?? ????? ??? ????? ?????. ? ??? ??? ??? ?? ???? ????? ??? ?? ??? ????? ???? ? ??? ?????.
???? ??? ??? ?? ??? ???(??- logits?? ?) ?? ?? ??? ???? ? ? ????.

?? 11? ?? ??? ???? ?????? ?????. ??? ??? ??? ?? ??? ???? ?? ????? ??? ?????. ?? ?? ????? ?? ?? ???? ???? ??? ??? ‘??’? ? ????.

*?? 11. ?? ??? ?? ???? ?????. Credit: Knowledge Distillation: A Survey*

??? ?? ? ?? ?? ??? ??? ?? LLM? ?? ??? ?? ??? ???? ???? ???, ?? ??? ??? ????? ??? ? ?? ? ?? ?????. ‘Distilling Step by Step!’? ? ?? ? ??? ?? ?? ??? ?? ???? ?? ?? LLM?? ??? ?????. ??? ??? ??? ???? ???? ??? ?? LLM? ????? ?? ?? ?? ?? ??? ???.

??? ?? ??? LLM? ?? LLM? ???? ? ???? ???? ?? ???? ???? ????? ??? ?? ??? ?? ??? ??? ???? ?? ???? ?? ?????.

?? ?? ??

?? ??? ?? ??? ???, ?? ????? ???? ??? ????. ?? ??? ?? ?? ???? ??? ??? ??? ???? ??? ???? ?? ????. ??? ?? ???? ??? ? ??? ? ?? ??? ???? ???. ?? ??, ??? ??? ??? ???. ? ?? ?? ??? ?? ? ????:

?-???? ??(In-flight batching)?? ?? ?? ?? ?? ??? ??? ???? ?? ?????.
?? ??(Speculative inference)? ??? ???? ?? ???? ?? ?? ??? ??? ???? ?? ?????.

?-???? ??(In-flight batching)

LLM?? ? ?? ??? ?? ??? ?? ??? ??? ????? ?? ???? ??? ? ????. ??? ??? ?? ?? ??? ??? ??? ??? ??? ??? ? ????. ??? ??? ?????? ?? ?? ?? ? ?? ??? ??? ????? ????? ?? ????, ??? ??? ? ?? ?????.

??? ????? ?? ??? ???? ??? ????? ???? ?? ??? ? ???, ?? ??? ??? ?? ???? ??????. ?? ?? ?? ??? ?? ???? ?? ?? ??? ? ????.

??? ?? ??? ???? ?? ?? LLM ?? ????? ?? ?? ?-???? ????? ???? ???? ??? ???? ????. ?? LLM? ?? ??? ?? ????? ???? ?? ?? ?? ???? ???? ? ??? ??? ?????.

????? ????? ?? ??? ??? ??? ????? ?? ?? ??? ???? ?? ?? ???? ??? ???? ???? ?? ?????. ?? ?? ?? ??? ?? ?? ?? ?? ? ??? ???? ?????. ??? ????? ??? ?? ?? ???? ?? GPU ???? ?? ?? ? ????.

?? ??(Speculative inference)

?? ???, ?? ?? ?? ?? ?? ?? ??????? ?? ?? ??? LLM ??? ????? ?? ?????. ????? GPT ???? ?? ?? ??? ?? ??? ???? ???? ?? ?? ?????.

???? ?? ??? ????? ???? ?? ? ?? ?? ?? ??? ?????. ?, ???? ????? ??? ????? ?? ?? ??? ??? ???? ?? ?????, n?? ??? ??? ??? ????? n+1? ??? ? ????.

?? 12? ?? ??? ??? ????? ???? ?? ?? ??? ????? ???? ?? ??? ?? ?????. ? ?? ???? ?? ? ?? ?? ??? ???? ??? ??? ???? ??? ???? ?? ?????.

*?? 12. ?? ??(speculative inference)? ?. Credit:* *Blockwise Parallel Decoding for Deep Autoregressive Models*

?? ???(Speculative sampling)? ?? ??? ?????. ? ???? ?? ????? ? ?? ??? ?? ??? ???? ?? “? ???” ????? ???? ????. ?? ??, ??? ?? ???? ??? ??? “??” ????? ???? ?? ???? ?? “??” ??? ??? ?????.

?? ??? ??? ??? ??? ???? ?? ??? ???? ???? ??? ?? ? ? ????. ??? ??? ???? ?? ? ?? ?? ??? ?? ??? ??? ? ???? ????? ??? ? ????.

?? ??? ???? ???? ?? ?? ??? ???, ? ???? ?? ?? ???? ????. ?? ??? ?????? ?? ??? ?? ???? ?? ?? ??? ?? ???? ??? ?? ??? ??? ??? ? ????. ?? ?? ??? ?? ??? ???? ? ?? ??? ??? ??? ???? ??? ?? ????.

??

? ?????? ????? ?? PC? ?? ??? LLM? ????? ????? ???? ? ??? ?? ?? ?? ?? ???? ?? ?????. ??? ?? ? ??? ???? ??, ??? ? ??? ??, ?? GPU/?? ?? ?? ?????? ?? TensorRT ? ?? ????? ??? ?? ?? ?????? NVIDIA TensorRT-LLM? ?? ????? ????, NVIDIA GPU?? ???? ??? ??? ? ????. ??? ??? Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available? ?????.

?? ???? TensorRT-LLM? NVIDIA Triton Inference Serve?? ?????, ??? ?? ???? ?? ?? ???? ??? AI ?????, ???? ??? ? ?? ??? ?? ?? AI ??? ??? ??? ? ????.

??, TensorRT-LLM? NVIDIA NeMo? ???? ???? ??? ?? ????? ?????? AI ??? ??, ?????? ? ??? ? ?? ????? ???? ???? ?????? ?????? ?????. : NeMo ????

?? ???

GTC ??: Taming LLMs with the Latest Customization Techniques (Spring 2023)
GTC ??: Optimizing Data Systems for Merlin and Triton (Spring 2023)
GTC ??: Leveraging Large Language Models for Generating Content (Spring 2023)
???: Deeper Dive into TensorRT and TRITON
???: Implementing Large Language Models
???: Optimization Strategies for Deploying Self-Driving DNNs with NVIDIA TensorRT

LLM ?? ?????: ???? ???

LLM ?? ????

??? ?? ?? ?? ??

??? ?? ?? ?? ??

??

?-?? ??

LLM ??? ????

LLM? ?? ???? ?? ???? ??

????? ???

?? ???

??? ???

??? ???? ??? ??

??-?? ???

??-?? ???

?? – ?? ???

??? ???

???? ?? KV ??? ???? ??

?? ??? ???

??? (Quantization)

???(Sparsity)

?? (Distillation)

?? ?? ??

?-???? ??(In-flight batching)

?? ??(Speculative inference)

??

?? ???

Tags

??? ??

??

Related posts

Spotlight: NVIDIA TensorRT-LLM? ??? NAVER Place? SLM Vertical Service ?? ????

LLM ?? ?? ?? ? ?? ???? ?? ???? ??

5??? ??? NVIDIA ?? ??? ?? ?? ?? ??

NVIDIA TensorRT-LLM ? NVIDIA Triton Inference Server? Meta Llama 3 ?? ??

NVIDIA TensorRT-LLM?? LoRA LLM ?? ? ??