在過去的幾年里，世代人工智能吸引了公眾的注意力和想象力。從給定的自然語言提示，這些生成模型能夠生成人類質量的結果，從清晰表達的兒童故事到產品原型可視化。

大型語言模型（ LLM ）是這場革命的中心。 LLM 是一種通用的語言理解器，它將人類知識編纂成法典，可以很容易地應用于許多自然語言和編程語言理解任務，開箱即用。其中包括摘要、翻譯、問題回答以及代碼注釋和完成。

單個基礎語言模型完成許多任務的能力開辟了一個全新的人工智能軟件范式，其中單個基礎模型可以用于滿足公司所有部門的多個下游語言任務。這簡化并降低了人工智能軟件開發、部署和維護的成本。

創建自定義大型語言模型簡介

盡管 LLM 強大且前景光明，但通過針對特定用例的零樣本或少量快照學習，與 LLM 現成的性能仍存在差距。特別是，零樣本學習性能往往很低且不可靠。另一方面，很少有鏡頭學習依賴于找到最佳的離散提示，這是一個不平凡的過程。

如 GPT Understands, Too 中所述，用于解決下游問題的提示模板的微小變化可能會對最終精度產生重大影響。此外，由于提示更大，少鏡頭推理的成本也更高。

已經提出了參數有效的微調技術來解決這個問題。即時學習就是這樣一種技術，它將虛擬提示令牌附加到請求中。這些虛擬令牌是可學習的參數，可以使用標準優化方法進行優化，而 LLM 參數是凍結的。

本文介紹了使用 NVIDIA NeMo 定制 LLM 的過程，這是一個用于訓練、定制和部署基礎模型的通用框架。

什么是 NVIDIA NeMo ？

NVIDIA NeMo 是用于訓練、定制和部署大型基礎模型的通用框架。 NeMo 利用各種并行技術來加速訓練和推理，可以部署在用戶首選云、本地和邊緣系統上的多節點、多 GPU 系統上。要了解更多信息，請參閱 NVIDIA AI Platform Delivers Big Gains for Large Language Models 和 Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server 。

NeMo 生態系統由以下主要組成部分組成：

NVIDIA NeMo service ：通過 NVIDIA 管理的服務，為 LLM 的產品化提供快速途徑。開發人員可以利用 LLM 功能快速輕松地開發企業人工智能應用程序，而無需擔心底層基礎設施。您還可以通過云 API 或網絡游樂場界面體驗最大的語言模型之一 Megatron 530B 。目前處于早期訪問狀態。
NVIDIA NeMo framework ：一個端到端的容器化框架，允許開發人員高效地訓練和部署具有數十億和數萬億參數的語言模型，在數千 GPU 秒內提供高訓練效率。 NeMo 框架容器目前位于 open beta 中，可通過 NGC 獲得。
NVIDIA/NeMo ：為研究語音人工智能和 NLP （包括 LLM ）的研究人員構建的開源對話式人工智能工具包。可通過 GitHub 獲得。
NeMo 模型： NVIDIA 最近開放了源代碼的預訓練 NeMo 框架模型，從 1.3B GPT-3 、 5B GPT-3 和 3B mT5 model 等小型模型到 20B GPT-3 等大型模型。
NVIDIA/FasterTransformer ：一個開源工具包，用于通過 GitHub 進行 LLM 的高性能推理。要了解有關如何使用 Faster transformer 部署公共 NeMo 框架模型的更多信息，請參閱 Deploying a 1.3B GPT-3 Model with NVIDIA NeMo Megatron 。

這篇文章解釋了如何使用 NeMo 框架容器通過即時學習技術自定義公共 NeMo 模型。

使用 NeMo 快速學習

Prompt learning 統稱為兩參數高效微調技術，如下所述。有關更多信息，請參閱 Adapting P-Tuning to Solve Non-English Downstream Tasks 。

在提示調諧中，軟提示嵌入被初始化為 2D 矩陣。每個任務都有自己的 2D 嵌入矩陣。任務在訓練或推理過程中不共享任何參數。所有 LLM 參數都被凍結，并且在訓練期間僅更新每個任務的嵌入參數。 NeMo 提示調諧實現基于 The Power of Scale for Parameter-Efficient Prompt Tuning 。
在 p 調諧中， LSTM 模型或“提示編碼器”用于預測虛擬令牌嵌入。 LSTM 參數在 p 調諧開始時被隨機初始化。所有 LLM 參數都被凍結，并且在每個訓練步驟僅更新 LSTM 權重。 LSTM 參數在同時 p 調諧的所有任務之間共享，但 LSTM 模型為每個任務輸出唯一的虛擬令牌嵌入。 NeMo p 調諧實現基于 GPT Understands, Too 。

本例的即時學習使用 NeMo 生態系統的兩個開源組件： NeMo OSS 工具包和公共 NeMo 模型。

GitHub 上的 NeMo Multitask Prompt and P-Tuning 教程詳細介紹了在小型 GPT-3 345M 參數模型上進行提示學習的過程。本教程演示了即時學習的端到端過程：下載和預處理數據、下載模型、訓練即時學習模型，以及在三個不同的應用程序上進行推理。

下面的部分首先瀏覽筆記本，同時總結主要概念。然后，這個筆記本將被擴展到對更大的 NeMo 模型進行即時學習。

先決條件

您可以通過 NeMo Docker 容器體驗 NeMo 。這為 NeMo 的實驗提供了一個自給自足和可再生的環境。 NeMo Multitask Prompt and P-Tuning 教程使用 NeMo 22.09 容器進行了測試，但您可以嘗試相同容器的后續版本。使用以下腳本下載并運行此容器：

docker run  -u $(id -u ${USER}):$(id -g ${USER}) --rm -it --net=host nvcr.io/nvidia/nemo:22.09 bash

然后從容器交互式 bash 環境中啟動 Jupyter 實驗室：

cd /workspace
jupyter lab --ip 0.0.0.0 --allow-root --port=8888

在 Jupyter 實驗室，您可以在/ workspace / NeMo / tutorial / nlp / Multitask _ Pompt _ and _ PTuning.ipynb 下找到 NeMo 示例，包括上述筆記本。

此外，您需要一個 GPU 來處理較小的 5B 和 1.3B GPT-3 模型，需要四個 NVIDIA Ampere architecture 或 NVIDIA Hopper architecture GPU 用于處理 20B 模型，因為它具有四個張量平行度（ TP ）。

數據準備

筆記本將引導您完成三種不同應用程序的數據收集和預處理過程： Financial PhraseBank dataset 用于情緒分析任務， SQuAD dataset 用于問答任務， Assistant Benchmarking dataset 用于意圖和時段分類任務。

數據集應為. jsonl 格式，其中包含一組 JSON 對象。每個 JSON 對象必須包括字段任務名稱，這是數據示例所對應任務的字符串標識符。每個 JSON 對象還應包括一個或多個字段，這些字段對應于離散文本提示的不同部分。示例見圖 1 。

Screenshot of dataset format for NVIDIA NeMo prompt learning. — *圖 1 。 NeMo 即時學習的數據集格式*

提示模板

在形成提示時，您應該確定并遵守一個模式。這種模式被稱為 prompt template ，并根據使用情況而變化。情緒分析的示例如下所示。

{
        "taskname": "sentiment",
        "prompt_template": "<|VIRTUAL_PROMPT_0|> {sentence} sentiment:{label}",
        "total_virtual_tokens": 10,
        "virtual_token_splits": [10],
        "truncate_field": None,
        "answer_only_loss": True,
        "answer_field": "label",
    }

提示包含開頭的所有 10 個虛擬標記，然后是要分類的目標句子。接下來是一個文本標記（“sentiment:”），最后是用于訓練的句子的標簽。訓練數據 JSON 對象中的相應字段將映射到此提示模板，以形成完整的訓練示例。 NeMo 支持修剪特定字段以滿足模型令牌長度限制（使用 HuggingFace GPT-2 令牌化器的 NeMo 公共模型通常為 2048 個令牌）。

訓練

默認的 NeMo 提示調優配置在 yaml 文件中提供，可通過 GitHub 上的 NVIDIA/NeMo 獲得。筆記本加載這個 yaml 文件，然后覆蓋訓練選項以適應 345M GPT 模型。 NeMo p 調諧使得能夠同時學習多個任務。 NeMo 利用 PyTorch Lightning 接口，因此只需調用trainer.fit(model)語句即可完成訓練。

推論

最后，一旦經過訓練，模型就可以通過調用model.generate(inputs=test_examples)語句來用于對新樣本的推理（省略“answer_field”）。

快速學習大型模型

筆記本電腦中演示的 345M GPT-3 模型過程可以應用于更大的公共 NeMo GPT-3 型號，最多 1.3B GPT-3 和 5B GPT-3 。這種尺寸的型號只需要一個足夠內存容量的 GPU ，例如 NVIDIA V100 、 NVIDIA A100 和 NVIDIA H100 。下載模型后，替換模型名稱；特別是在以下單元格中：

# Download the model from NGC
gpt_file_name = "megatron_gpt_345m.nemo"
!wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/megatron_gpt_345m/versions/1/files/megatron_gpt_345m.nemo -

不要從 NGC 下載 345M GPT 模型，而是按照 HuggingFace 上的說明下載 1.3B GPT-3 或 5B GPT-3 模型，然后將gpt_file_name變量指向。 NeMo 模型文件。

請注意，對于 5B 型號，有兩種變體，一種是 TP 度為 1 （ nemo_gpt5B_fp16_tp1.nemo ），另一種是 TP = 2 （ nemo_gpt5B_fp16_tp2.nemo, nemo_gpt5B_bf16_tp2.nemo ) ）。筆記本電腦只能支持 TP = 1 變體。在其他一切不變的情況下，您可以端到端執行同一筆記本電腦。

多 – GPU 即時學習

由于 Jupyter 筆記本環境的限制，即時學習筆記本僅支持單次 – GPU 訓練。針對更大的模型利用多 GPU 訓練，具有更高程度的 TP （例如 20B GPT-3 為 4 ， 5B GPT-3 為其他變體為 2 ）需要使用不同的 NeMo prompt learning script 。此腳本受 config文件在這里可以找到許多參數的默認值。

模型

本節演示了在作為提示學習筆記本一部分下載并預處理的輔助數據集上使用多個 GPU 對大型模型進行提示學習的過程。

您可以下載 TP = 2 的 5B GPT 型號（ nemo_gpt5B_fp16_tp2.nemo) ）或 TP = 4 的 20B GPT-3 型號。請注意，這些模型存儲在中。 NeMo 壓縮存檔。要大幅加快模型加載速度，請提前解壓縮模型，并在 NeMo 配置中使用此解壓縮的文件夾。使用以下腳本：

tar -xvf nemo_gpt5B_fp16_tp2.nemo -C nemo_gpt5B_fp16_tp2.nemo.extracted

然后使用nemo_gpt5B_fp16_tp2.nemo.extracted NeMo 中提取的目錄nemo_gpt5B_fp16_tp2.nemo.extracted。

配置

適用于輔助數據集（意圖和插槽檢測應用程序）的配置文件如下所示：

name: megatron_virtual_prompt_gpt

trainer:
  devices: 2
  accelerator: gpu
  num_nodes: 1
  precision: 16
  logger: False # logger provided by exp_manager
  enable_checkpointing: False
  replace_sampler_ddp: False
  max_epochs: 25 # min 25 recommended
  max_steps: -1 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches
  log_every_n_steps: 10 # frequency with which training steps are logged 
  val_check_interval: 1.0 # If is an int n > 1, will run val every n training steps, if a float 0.0 - 1.0 will run val every epoch fraction, e.g. 0.25 will run val every quarter epoch
  gradient_clip_val: 1.0
  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
  benchmark: False


exp_manager:
  explicit_log_dir: null
  exp_dir: null
  name: ${name}
  create_wandb_logger: False
  wandb_logger_kwargs:
    project: null
    name: null
  resume_if_exists: True
  resume_ignore_no_checkpoint: True
  create_checkpoint_callback: True
  checkpoint_callback_params:
    monitor: val_loss
    save_top_k: 2
    mode: min
    save_nemo_on_train_end: False # Should be false, correct prompt learning model file is saved at model.nemo_path set below, 
    filename: 'megatron_gpt_prompt_tune--{val_loss:.3f}-{step}'
    model_parallel_size: ${model.tensor_model_parallel_size}
    save_best_model: True

model:
  seed: 1234
  nemo_path: ${name}.nemo # .nemo filename/absolute path to where the virtual prompt model parameters will be saved
  virtual_prompt_style: 'p-tuning' # one of 'prompt-tuning', 'p-tuning', or 'inference'
  tensor_model_parallel_size: 1 # intra-layer model parallelism
  pipeline_model_parallel_size: 1 # inter-layer model parallelism
  global_batch_size: 8
  micro_batch_size: 4

  restore_path: null # Path to an existing p-tuned/prompt tuned .nemo model you wish to add new tasks to or run inference with
  language_model_path: ??? # Path to the GPT language model .nemo file, always required
  save_nemo_on_validation_end: True # Saves an inference ready .nemo file every time a checkpoint is saved during training. 
  existing_tasks: [] # List of tasks the model has already been p-tuned/prompt-tuned for, needed when a restore path is given
  new_tasks: ['intent_and_slot'] # List of new tasknames to be prompt-tuned
  


  ## Sequence Parallelism
  # Makes tensor parallelism more memory efficient for LLMs (20B+) by parallelizing layer norms and dropout sequentially
  # See Reducing Activation Recomputation in Large Transformer Models: https://arxiv.org/abs/2205.05198 for more details.
  sequence_parallel: False

  ## Activation Checkpoint 
  activations_checkpoint_granularity: null # 'selective' or 'full' 
  activations_checkpoint_method: null # 'uniform', 'block', not used with 'selective'
  # 'uniform' divides the total number of transformer layers and checkpoints the input activation
  # of each chunk at the specified granularity
  # 'block' checkpoints the specified number of layers per pipeline stage at the specified granularity
  activations_checkpoint_num_layers: null # not used with 'selective'

  task_templates: # Add more/replace tasks as needed, these are just examples
  - taskname: "intent_and_slot"    
    prompt_template: "<|VIRTUAL_PROMPT_0|>Predict intent and slot: {utterance} \nLabel:{label}"
    total_virtual_tokens: 10
    virtual_token_splits: [10]
    truncate_field: null
    answer_only_loss: False
    "answer_field": "label"


  prompt_tuning: # Prompt tunin specific params
    new_prompt_init_methods: ['text'] # List of 'text' or 'random', should correspond to tasks listed in new tasks
    new_prompt_init_text: ['some init text goes here'] # some init text if init method is text, or None if init method is random

  p_tuning: # P-tuning specific params
    encoder_type: "tpmlp" # ['tpmlp', 'lstm', 'biglstm', 'mlp'] 
    dropout: 0.0
    num_layers: 2  # number of layers for MLP or LSTM layers. Note, it has no effect for tpmlp currently as it always assumes it is two layers.
    encoder_hidden: 2048 # encoder hidden for biglstm and tpmlp
    init_std: 0.023  # init std for tpmlp layers

  data:
    train_ds: ???
    validation_ds: ???
    add_eos: True
    shuffle: True
    num_workers: 8
    pin_memory: True
    train_cache_data_path: null  # the path to the train cache data 
    validation_cache_data_path: null  # the path to the validation cache data 
    test_cache_data_path: null  # the path to the test cache data 
    load_cache: False  # whether to load from the cache data


  optim:
    name: fused_adam
    lr: 1e-4
    weight_decay: 0.01 
    betas: 
    - 0.9
    - 0.98
    sched:
      name: CosineAnnealing
      warmup_steps: 50
      min_lr: 0.0 # min_lr must be 0.0 for prompt learning when pipeline parallel > 1
      constant_steps: 0 # Constant steps should also be 0 when min_lr=0
      monitor: val_loss
      reduce_on_plateau: false

得益于 yaml 文本格式和注釋，大多數超參數都是不言自明的。使用 Jupyter 實驗室界面，創建一個包含此內容的文件，并將其保存在/workspace/nemo/examples/nlp/language_modeling/conf/megatron_gpt_prompt_learning_intent_n_slot.yaml下。

config文件中最重要的是如下所示的提示模板：

 prompt_template: "<|VIRTUAL_PROMPT_0|>Predict intent and slot: {utterance} \nLabel:{label}"
    total_virtual_tokens: 10
    virtual_token_splits: [10]
    truncate_field: null

這里， 10 個虛擬提示令牌與一些永久文本標記一起使用。

訓練

要開始培訓，請在 Jupyter 實驗室界面中打開一個終端窗口（文件→ 新建→ 終端）。然后發出 bash 命令：

python /workspace/nemo/examples/nlp/language_modeling/megatron_gpt_prompt_learning.py \
    	--config-name=megatron_gpt_prompt_learning_intent_n_slot.yaml \
    	trainer.devices=2 \
    	trainer.num_nodes=1 \
    	trainer.max_epochs=25 \
    	trainer.precision=bf16 \
    	model.language_model_path=/workspace/nemo/tutorials/nlp/nemo-megatron-gpt-5B/nemo_gpt5B_fp16_tp2.nemo.extracted \
    	model.nemo_path=/workspace/nemo/examples/nlp/language_modeling/intent_n_slot.nemo \
    	model.tensor_model_parallel_size=2 \
    	model.pipeline_model_parallel_size=1 \
    	model.global_batch_size=16 \
    	model.micro_batch_size=1 \
    	model.optim.lr=1e-4 \
    	model.data.train_ds=[/workspace/nemo/tutorials/nlp/data/assistant/assistant_train.jsonl] \
    	model.data.validation_ds=[/workspace/nemo/tutorials/nlp/data/assistant/assistant_val.jsonl]

請注意以下內容：

對于 5B GPT 模型（ nemo_gpt5B_fp16_tp2.nemo) ），model.tensor_model_parallel_size應設置為 2 ，對于 20B GPT-3 模型，應設置為 4
trainer.devices應設置為 TP 值的倍數。如果 5B 模型為 4 ，則將有兩個數據并行工作者，每個工作者有兩個 GPU
model.language_model_path應設置為模型提取目錄的絕對路徑
model.data.train_ds、model.data.validation_ds應設置為列車位置和驗證數據

推論

最后，經過訓練后，使用以下腳本在 NeMo 中進行推理：

python /workspace/nemo/examples/nlp/language_modeling/megatron_gpt_prompt_learning_eval.py \
            virtual_prompt_model_file=/workspace/nemo/examples/nlp/language_modeling/intent_n_slot.nemo \
            gpt_model_file=/workspace/nemo/tutorials/nlp/nemo-megatron-gpt-5B/nemo_gpt5B_fp16_tp2.nemo.extracted  \
            inference.greedy=True \
            inference.add_BOS=False \
            inference.tokens_to_generate=128 \
            trainer.devices=2 \
            trainer.num_nodes=1 \
            tensor_model_parallel_size=2 \
            pipeline_model_parallel_size=1 \
            data_paths=["/workspace/nemo/tutorials/nlp/data/assistant/assistant_test.jsonl"] \
            pred_file_path="test-results.txt"

請注意以下內容：

對于 5B GPT 模型（ nemo_gpt5B_fp16_tp2.nemo) ），model.tensor_model_parallel_size應設置為 2 ，對于 20B GPT-3 模型，應設置為 4
trainer.devices應設置為等于 TP 值（如上）
pred_file_path是記錄測試結果的文件，每個測試樣本一行

開始自定義您的語言模型

本文介紹了使用 NVIDIA NeMo 進行語言模型定制的過程。通過一個參數高效、計算高效的過程，這些模型可以從單個公共檢查點適用于許多 NLP 應用程序。

訪問 GitHub 上的 NVIDIA/NeMo ，開始 LLM 定制。您也可以注冊 NVIDIA NeMo service 早期訪問程序。

Register for GTC 2023 for free ，并于 3 月 20 日至 23 日加入我們的 How to Build Generative AI for Enterprise Use Cases 和 generative AI models for science and art 、 autonomous vehicles 、 content creation, and much more 的目標會話曲目。

如何創建自定義語言模型

創建自定義大型語言模型簡介

什么是 NVIDIA NeMo ？

使用 NeMo 快速學習

先決條件

數據準備

提示模板

訓練

推論

快速學習大型模型

多 – GPU 即時學習

模型

配置

訓練

推論

開始自定義您的語言模型

相關資源

標簽

關于作者

如何創建自定義語言模型

創建自定義大型語言模型簡介

什么是 NVIDIA NeMo ？

使用 NeMo 快速學習

先決條件

數據準備

提示模板

訓練

推論

快速學習大型模型

多 – GPU 即時學習

模型

配置

訓練

推論

開始自定義您的語言模型

相關資源

標簽

關于作者

相關文章

選擇大型語言模型定制技術

采用 P-Tuning 解決非英語下游任務

相關文章

使用 NVIDIA Isaac ROS 開發人員預覽版 3 構建高性能機器人應用程序

NVIDIA DGX 云與 Oracle 云基礎架構上的高性能存儲

GROMACS 2023 中的 CUDA 圖指南

利用三維合成數據進行自舉目標檢測模型訓練

使用 NVIDIA Nsight 系統加速數據中心和 HPC 性能分析