使用 NVIDIA NeMo 框架進行 LLM 模型剪枝和知識蒸餾

模型剪枝和知識蒸餾是功能強大且經濟高效的策略，用于從最初較大的同級獲得較小的語言模型。

剪枝：丟棄圖層 (深度剪枝) 或丟棄神經元、注意力頭和嵌入通道 (寬度剪枝)。
知識蒸餾： 將知識從大型教師模型轉移到較小的學生模型，目標是創建更高效、更小、運行速度更快且資源密集型更低的模型。

在一篇“ 如何剪枝和蒸餾 Llama-3.1 8B ”博文中，討論了使用大語言模型(LLM) 的最佳實踐，該模型將深度、寬度、注意力和 MLP 剪枝與基于蒸餾的知識重新訓練相結合。

在本文中，我們提供了一個關于 NVIDIA NeMo 框架中基于簡單數據集的剪枝和蒸餾工作流的演練教程。本教程使用 Meta-Llama-3.1-8B 作為教師模型，目標模型大小為 4B。我們還會可視化并討論訓練結果。

概述?

本教程重點介紹如何創建一個簡單的工作流，用于準備數據集，針對 WikiText-103-v1 數據集對教師進行微調，然后對模型進行剪枝和蒸餾以創建 4B 模型。WikiText-103-v1 數據集包含從維基百科上一系列經過驗證的“良好”和“精選”文章中提取的逾 100M 個令牌。它已在 Hugging Face 上公開發布。

在本教程中，您將定義涉及以下高級步驟的剪枝和蒸餾工作流 (圖 1)。

A workflow diagram shows downloading the dataset, tokenizing, fine-tuning the 8B teacher dataset, pruning the teacher model, and distilling knowledge from teacher to student. — *圖 1. 從獲取數據集到創建蒸 4B 模型的步驟*

準備工作:
- 下載數據集并轉換為 JSONL。
- 通過對數據集進行標記化預處理。
- 在數據集上微調教師模型。
- 深度剪枝微調的教師模型。深度剪枝模型是學生網絡的起點。
- Width-prune 經過微調的教師模型。寬度剪枝模型是學生網絡的起點。
通過將 8B 模型用作教師，將 4B 剪枝模型用作學生，將知識從教師提煉給學生。

要訪問本教程中的 Jupyter 筆記本，請參閱 /NVIDIA/NeMo GitHub 存儲庫。

預備知識?

您需要訪問至少 8 個 NVIDIA GPUs（單個顯存為 80 GB），例如 8 個 H100-80GB 或 A100-80GB GPUs，以及一個支持 Docker 的環境。

按照項目的 README 文件中的說明安裝 NeMo 框架，下載 Meta-Llama-3.1-8B Instruct 模型，并獲取 Hugging Face 訪問令牌的訪問權限。

下載數據集?

下載 WikiText-103-v1 數據集，并使用以下代碼或運行 introduction notebook ，將訓練、測試和驗證拆分轉換為 JSONL 文件：

# Split into train, test and val files
 
import json
import os
from datasets import load_dataset
 
# Load the WikiText-103 dataset
dataset = load_dataset("wikitext", "wikitext-103-v1")
 
# Define the destination folder
data_folder = 'wikitext-data'
os.makedirs(data_folder, exist_ok=True)
 
# Define file paths and destination paths
file_paths = {
    'train': os.path.join(data_folder, 'wikitext-train.jsonl'),
    'validation': os.path.join(data_folder, 'wikitext-val.jsonl'),
    'test': os.path.join(data_folder, 'wikitext-test.jsonl')
}
 
# Function to save dataset split to a JSONL file
def save_to_jsonl(file_path, data):
    with open(file_path, 'w') as file:
        for item in data:
            file.write(json.dumps(item) + '\n')
 
# Define splits
splits = ["train", "validation", "test"]
 
# Save splits to JSONL files and calculate their sizes
for split in splits:
    if split in dataset:
        save_to_jsonl(file_paths[split], dataset[split])
    else:
        print(f"Split {split} not found in the dataset.")

準備數據集?

剪枝和蒸餾腳本需要通過使用 meta-llama/Meta-Llama-3.1-8B 標記器模型對數據文件進行標記化來預處理數據文件，從而將數據轉換為內存映射格式。這可以通過 NeMo 框架中的預處理腳本 preprocess_data_for_megatron.py 完成。

在 train split 中運行以下腳本，以準備用于剪枝和蒸餾的數據集：

!python /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input="./wikitext-data/wikitext-train.jsonl" \
--tokenizer-library='huggingface' \
--tokenizer-type='meta-llama/Meta-Llama-3.1-8B' \
--output-prefix=wikitext_tokenized_train \
--append-eod \
--workers=32

對測試和驗證拆分運行腳本。數據準備 notebook 包含用于創建可用于微調 teacher model 的標記化 wikitext_tokenized_{train/val/test}_text_document.{idx/bin} 文件的所有腳本。

在數據集上微調教師模型

使用準備好的數據集，對未剪枝的教師模型執行微調過程。本節展示了腳本的用法，而非性能，因此運行微調設置時，將 GLOBAL_BATCH_SIZE 設置為 128，將 STEPS 設置為 30，以確保縮短訓練時間。

A workflow diagram shows multiple steps: input token, embedding, transformer layers, LM head, Softmax, Logits, cross-entropy loss, and next token. Steps are marked as trainable or loss. — *圖 2、教師微調*

運行 megatron_gpt_pretraining.py 腳本，以修正用于訓練模型的原始數據集的分布偏移。在不修正分布偏移的情況下，教師會在提取數據集時提供次優指導。

%%bash 
 
export CUDA_DEVICE_MAX_CONNECTIONS=1
 
# Set path(s) if different:
 
MODEL="/workspace/llama-3_1-8b-nemo_v1.0/llama3_1_8b.nemo"
 
# Can change these to accommodate resources:
 
TENSOR_PARALLEL_SIZE=8
NODES=1
MICRO_BATCH_SIZE=4
 
# Don't change the following:
 
EXPERIMENT_DIR="distill_trainings"
EXPERIMENT_NAME="megatron_llama_ft"
 
DATA_TRAIN='wikitext_tokenized_train_text_document'
DATA_VAL='wikitext_tokenized_test_text_document'
DATA_TEST='wikitext_tokenized_val_text_document'
 
STEPS=30
GLOBAL_BATCH_SIZE=128
 
LOG_INTERVAL=1
VAL_INTERVAL=10
NUM_VAL_BATCHES=5
 
LR=1e-4
MIN_LR=1e-5
WARMUP_STEPS=2
 
cmd="torchrun --nproc-per-node=${TENSOR_PARALLEL_SIZE}"
 
${cmd} /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
    --config-path /opt/NeMo/examples/nlp/language_modeling/conf/ \
    --config-name megatron_llama_distill.yaml \
    \
    name=${EXPERIMENT_NAME} \
    \
    exp_manager.exp_dir=${EXPERIMENT_DIR} \
    exp_manager.checkpoint_callback_params.save_top_k=1 \
    exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
    \
    trainer.max_steps=${STEPS} \
    trainer.log_every_n_steps=${LOG_INTERVAL} \

運行腳本或執行教師微調 notebook 可創建經過微調的教師模型。

剪枝經過微調的教師模型以創建學生模型

您可以使用兩種方法來剪枝經過微調的教師模型：depth-pruning 和 width-pruning。

從技術報告中可以看到，寬度剪枝的準確性通常優于深度剪枝，但代價是增加推理延遲。根據這些考慮因素，選擇執行深度剪枝、寬度剪枝或這兩種方法。

A diagram shows the iterative steps of training the LLM, estimating importance, ranking, trimming, and distilling. — *圖 3、剪枝經過微調的教師模型*

對經過微調的教師模型進行深度剪枝，以創建一個學生模型

在第一種方法中，您可以對模型進行深度剪枝。要從 8B 到 4B 模型，請剪枝最后 16 層 (第 16 至 31 層)。運行 megatron_gpt_drop_layers.py 腳本以深度微調經過調優的教師模型：

!python -m torch.distributed.launch --nproc_per_node=8 \
     /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_drop_layers.py \
     --path_to_nemo "./distill_trainings/megatron_llama_ft/checkpoints/megatron_llama_ft.nemo" \
     --path_to_save "/workspace/4b_depth_pruned_model.nemo" \
     --tensor_model_parallel_size 8 \
     --pipeline_model_parallel_size 1 \
     --gpus_per_node 8 \
     --drop_layers 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

運行此腳本或執行深度剪枝 notebook 會導致創建較小的檢查點，并刪除最后 16 層：4b_depth_pruned_model.nemo。

Width-prune 經過微調的教師模型，以創建一個學生模型

在第二種方法中，您可以調整模型的寬度。要從 8B 模型升級到 4B 模型，請通過減少 MLP 中間維度和隱藏大小以及重新訓練注意力頭數和層數來剪枝模型。

運行 megatron_gpt_prune.py 腳本，以調整經過微調的教師模型的寬度：

!torchrun --nproc-per-node=8 /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_prune.py \
     model.restore_from_path="./distill_trainings/megatron_llama_ft/checkpoints/megatron_llama_ft.nemo" \
     model.tensor_model_parallel_size=1 \
     model.pipeline_model_parallel_size=8 \
     +model.dist_ckpt_load_strictness=log_all \
     inference.batch_size=64 \
     trainer.num_nodes=1 \
     trainer.precision=bf16 \
     trainer.devices=8 \
     prune.ffn_hidden_size=9216 \
     prune.num_attention_heads=null \
     prune.num_query_groups=null \
     prune.hidden_size=3072 \
     export.save_path="/workspace/4b_width_pruned_model.nemo"

運行此腳本或執行寬度剪枝 notebook 會導致創建較小的寬度剪枝檢查點：4b_width_pruned_model.nemo。

蒸餾知識從教師轉化為學生模型

蒸餾過程將微調模型 (8B) 用作教師模型，將剪枝模型用作學生模型 (4B)，將蒸餾用作較小的 4B 模型。目前 NeMo 中只提供 logit 損失函數。

A workflow diagram shows classical knowledge distillation from teacher to student, with loss function from several layers of the transformer architecture. A student model with N layers is distilled from a teacher model with M layers. The student learns by minimizing a combination of embedding output loss, logit loss and transformer encoder specific losses mapped across student block S and teacher block T. — *圖 4. 蒸餾工作流程*

在本節中，您將教師模型中的知識分為兩個學生模型，并進行比較：

蒸餾從微調教師到深度剪枝學生的知識
蒸餾從微調教師到寬度剪枝學生的知識

蒸餾知識，從經過 fine-tuned 的教師到經過 depth-pruned 的學生模型

運行 megatron_gpt_distillation.py 腳本，將蒸餾知識從教師擴展到深度剪枝學生模型。

%%bash 
 
export CUDA_DEVICE_MAX_CONNECTIONS=1
 
# Can change these to accommodate resources:
 
TENSOR_PARALLEL_SIZE=8
NODES=1
MICRO_BATCH_SIZE=4
 
# Don't change the following:
 
EXPERIMENT_DIR="distill_trainings"
EXPERIMENT_NAME="megatron_llama_distill_depth_pruned_student"
 
TEACHER="${EXPERIMENT_DIR}/megatron_llama_ft/checkpoints/megatron_llama_ft.nemo"
STUDENT="/workspace/4b_depth_pruned_model.nemo"
 
FINAL_MODEL_PATH="${EXPERIMENT_DIR}/${EXPERIMENT_NAME}/checkpoints/depth_pruned_distilled_4b_model.nemo"
 
DATA_TRAIN='wikitext_tokenized_train_text_document'
DATA_VAL='wikitext_tokenized_test_text_document'
DATA_TEST='wikitext_tokenized_val_text_document'
 
STEPS=30
GLOBAL_BATCH_SIZE=128
 
LOG_INTERVAL=1
VAL_INTERVAL=10
NUM_VAL_BATCHES=5
 
LR=1e-4
MIN_LR=1e-5
WARMUP_STEPS=2
 
cmd="torchrun --nproc-per-node=${TENSOR_PARALLEL_SIZE}"
 
${cmd} /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_distillation.py \
    name=${EXPERIMENT_NAME} \
    \
    exp_manager.exp_dir=${EXPERIMENT_DIR} \
    exp_manager.checkpoint_callback_params.save_top_k=1 \
    \
    trainer.max_steps=${STEPS} \
    trainer.log_every_n_steps=${LOG_INTERVAL} \
    trainer.val_check_interval=${VAL_INTERVAL} \
    trainer.limit_val_batches=${NUM_VAL_BATCHES} \
    +trainer.num_sanity_val_steps=0 \
    \
    trainer.precision=bf16 \
    trainer.devices=${TENSOR_PARALLEL_SIZE} \
    trainer.num_nodes=${NODES} \
    \
    "model.data.data_prefix={train:[1.0,$DATA_TRAIN],validation:[$DATA_VAL],test:[$DATA_TEST]}" \
    \
    model.restore_from_path=${STUDENT} \
    model.kd_teacher_restore_from_path=${TEACHER} \
    model.nemo_path=${FINAL_MODEL_PATH} \
    \
    model.tensor_model_parallel_size=${TENSOR_PARALLEL_SIZE} \
    model.sequence_parallel=True \
    model.micro_batch_size=${MICRO_BATCH_SIZE} \
    model.global_batch_size=${GLOBAL_BATCH_SIZE} \
    \
    model.optim.name=distributed_fused_adam \
    model.optim.lr=${LR} \
    model.optim.sched.min_lr=${MIN_LR} \
    model.optim.sched.warmup_steps=${WARMUP_STEPS}

運行此腳本或經過深度剪枝的提煉學生 notebook 會創建一個提煉模型：depth_pruned_distilled_4b_model.nemo。

蒸餾知識，從經過微調的教師到寬度剪枝的學生模型

運行 megatron_gpt_distillation.py 腳本，將蒸餾知識從教師傳遞到寬度剪枝的學生模型。在運行腳本之前，更改學生模型 (STUDENT) 和蒸餾模型的保存目錄 (FINAL_MODEL_PATH)。

運行經寬度剪枝的提煉學生 notebook 會生成提煉模型 width_pruned_distilled_4b_model.nemo。

顯示驗證損失?

運行以下代碼命令或執行結果 notebook 以可視化驗證損失。在運行代碼示例之前，請修改檢查點的路徑：

%load_ext tensorboard
%tensorboard --logdir "distill_trainings/megatron_llama_distill/" --port=6007

當在 STEPS 值為 30 的情況下運行蒸餾腳本時，您可以看到驗證損失，圖 5 和圖 6 分別為深度剪枝學生和寬度剪枝學生。

A plot shows the validation loss under 8 after running the training step in the distillation script for 30 steps with the depth-pruned student. — 圖 5. Depth-pruned 驗證損失超過 30 步

A plot shows the validation loss under 8 after running the training step in the distillation script for 30 steps with the width-pruned student. — *圖 6、超過 30 步的寬度剪枝驗證損失*

要為您的用例配置此管道，請在具有更大 GLOBAL_BATCH_SIZE, STEPS 和 VAL_INTERVAL 值的多節點集群上運行腳本，以確保驗證損失得到改善。

圖 7 和圖 8 顯示，當您在蒸餾腳本中運行訓練步驟時，在分別使用深度剪枝和寬度剪枝學生的情況下，STEPS 值為 880 和 GLOBAL_BATCH_SIZE 值為 2048 時，驗證損失會減少。

A plot shows the validation loss under 2.5 after running the training step in the distillation script with the depth-pruned model as the student. — *圖 7、深度剪枝驗證損失超過 880 步 (使用 GBS=2048 時)*

A plot shows the validation loss drop to under 2.5 after running the training step in the distillation script with the width-pruned model as the student. — *圖 8、寬度剪枝驗證損失超過 880 步 (使用 GBS=2048 時)*

結束語?

剪枝和蒸餾代表了語言模型優化領域的重大進步。能夠在資源受限的環境中創建更小、更高效的模型 (如 Llama-3.1-Minitron-4B)，同時保持性能且不犧牲大量準確性，這是 AI 行業的游戲規則變革。

Mistral-NeMo-Minitron-8B 模型是使用這種方法開發的，在各種基準測試中表現優于 Llama-3.1-8B 模型。

這種方法降低了推理時的計算成本和能耗，還普及了對高級 NLP 功能的使用。這可能會徹底改變移動設備、邊緣計算和受限資源設置中的真實應用。隨著這些技術的不斷發展，您預計會看到更緊湊但強大的語言模型，進一步擴展這項技術的覆蓋范圍到各行各業。

有關更多信息，請參閱以下資源：

支持剪枝和蒸餾 recipes 的 Jupyter notebooks
通過剪枝和知識構建緊湊語言模型蒸餾 Compact Language Models via Pruning and Knowledge Distillation 研究論文
LLM 剪枝和蒸餾的實際應用：Minitron 方法與性能指標的討論
如何剪枝和蒸餾 Llama-3.1 8B 到 NVIDIA Llama-3.1-Minitron 4B 模型的帖子，介紹了圍繞剪枝和蒸餾技術的良好實踐
Mistral-NeMo-Minitron 8B 模型在展示 Mistral-NeMo-Minitron 8B 模型的性能基準測試時，可提供無與倫比的準確性

使用 NVIDIA NeMo 框架進行 LLM 模型剪枝和知識蒸餾

概述?

預備知識?

下載數據集?

準備數據集?

在數據集上微調教師模型

剪枝經過微調的教師模型以創建學生模型

對經過微調的教師模型進行深度剪枝，以創建一個學生模型

Width-prune 經過微調的教師模型，以創建一個學生模型

蒸餾知識從教師轉化為學生模型

蒸餾知識，從經過 fine-tuned 的教師到經過 depth-pruned 的學生模型

蒸餾知識，從經過微調的教師到寬度剪枝的學生模型

顯示驗證損失?

結束語?

相關資源

標簽

關于作者

使用 NVIDIA NeMo 框架進行 LLM 模型剪枝和知識蒸餾

概述?

預備知識?

下載數據集?

準備數據集?

在數據集上微調教師模型

剪枝經過微調的教師模型以創建學生模型

對經過微調的教師模型進行深度剪枝，以創建一個學生模型

Width-prune 經過微調的教師模型，以創建一個學生模型

蒸餾知識從教師轉化為學生模型

蒸餾知識，從經過 fine-tuned 的教師到經過 depth-pruned 的學生模型

蒸餾知識，從經過微調的教師到寬度剪枝的學生模型

顯示驗證損失?

結束語?

相關資源

標簽

關于作者

相關文章

Mistral-NeMo-Minitron 8B 基礎模型實現準確性巔峰

如何在 NVIDIA Llama-3.1-Minitron 4B 模型上修剪和提煉 Llama-3.1 8B

相關文章

NVIDIA NIM Operator 2.0 借助 NVIDIA NeMo 微服務支持提高 AI 部署效率

選擇您的第一個本地人工智能項目

構建應用程序以安全使用 KV 緩存

聚焦：個人 AI 借助 NVIDIA Riva 為小企業主帶來 AI 接待員

借助代理式 AI 系統推進網絡安全運營