使用 NVIDIA NeMo Megatron 部署 1.3B GPT-3 型號

Large language models ( LLMs）是一些能夠理解書面語言的最先進的深度學習算法。許多現代 LLM 是使用谷歌 2017 年在 Attention Is All You Need 研究論文中引入的 transformer 網絡構建的。

NVIDIA NeMo Megatron 是一個端到端 GPU 加速框架，用于訓練和部署高達萬億參數的基于 transformer 的 LLM 。 2022 年 9 月， NVIDIA 宣布 NeMo Megatron is now available in Open Beta ，允許您使用自己的數據訓練和部署 LLM 。通過此次發布，幾個經過預訓練的檢查點已上傳到 HuggingFace ，使任何人都可以使用 GPU 在本地部署 LLM 。

本文介紹了使用 NeMo Megatron下載、優化和部署 13 億參數 GPT-3 模型的過程。它包括 NVIDIA Triton Inference Server ，一個強大的開源推理服務軟件，可以部署多種模型，并以可擴展的方式為 CPU 和 GPU 上的推理請求提供服務。

系統要求

雖然訓練 LLM 需要大量的計算能力，但對于大多數用例來說，可以以小得多的規模部署經過訓練的模型進行推理。

HuggingFace 的模型可以部署在具有以下規格的本地機器上：

運行現代 Linux 操作系統（用 Ubuntu 20.04 測試）。
NVIDIA Ampere 架構 GPU 或更新版本，具有至少 8 GB GPU 內存。
至少 16 GB 的系統內存。
Docker 19.03 版或更新版本，帶有 NVIDIA Container Runtime 。
Python 3.7 或更新版本，帶有 PIP 。
用于下載模型的可靠 Internet 連接。
允許防火墻，如果服務于來自遠程機器的推理請求。

準備

NeMo Megatron is now in Open Beta ，可供完成免費注冊表格的任何人使用。需要注冊才能訪問訓練和推理容器，以及用于轉換和部署訓練模型的助手腳本。

幾個經過訓練的 NeMo Megatron 模型在 HuggingFace 上公開托管，包括 1.3B 、 5B 和 20B GPT-3 模型。這些模型已轉換為. NeMo 格式，該格式已優化用于推理。

轉換后的模型不能重新訓練或微調，但它們可以部署經過充分訓練的模型進行推理。與轉換前檢查點相比，這些模型的尺寸要小得多，并且受 FasterTransformer (FT) 格式的支持。更快的 transformer 是 Triton 推理服務器中的后端，用于跨 GPU 和節點運行 LLM 。

為了這篇文章的目的，我們使用了 1.3B 模型，它具有最快的推理速度，并且可以舒適地適應大多數現代 GPU 的內存。

要轉換模型，請運行以下步驟。

將 1.3B 型號下載到您的系統。在所需目錄中運行以下命令，以保存 NVIDIA Triton 的轉換模型：

wget https://huggingface.co/nvidia/nemo-megatron-gpt-1.3B/resolve/main/nemo_gpt1.3B_fp16.nemo

記下模型復制到的文件夾，因為它在本文的其余部分中一直使用。

驗證下載文件的 MD5sum ：

$ md5sum nemo_gpt1.3B_fp16.nemo
38f7afe7af0551c9c5838dcea4224f8a  nemo_gpt1.3B_fp16.nemo

使用網絡瀏覽器登錄 NGC. NVIDIA .com 上的 NGC 。通過選擇帳戶名進入 Setup 菜單。選擇 Get API Key ，然后選擇 Generate API Key 以創建令牌。記下鑰匙，因為它只顯示一次。

在終端中，將令牌添加到 Docker ：

$ docker login nvcr.io
Username: $oauthtoken
Password: <insert token here>

用生成的令牌替換<insert token here>。用戶名必須完全為$oauthtoken，因為這表示正在使用個人訪問令牌。

提取 NeMo Megatron 的最新訓練和推理圖像：

$ docker pull nvcr.io/ea-bignlp/bignlp-training:22.08.01-py3
$ docker pull nvcr.io/ea-bignlp/bignlp-inference:22.08-py3

在出版時，最新的圖像標簽是用于訓練的22.08.01-py3和用于推斷的22.08-py3。我們建議檢查 NGC 上的較新標簽，并在可用的情況下刪除這些標簽。

驗證圖像是否已成功提取，因為 ID 可能會隨不同的標簽而變化：

$ docker images | grep "ea-bignlp/bignlp"
nvcr.io/ea-bignlp/bignlp-training                       22.08.01-py3                         d591b7488a47   11 days ago     17.3GB
nvcr.io/ea-bignlp/bignlp-inference                      22.08-py3                            77a6681df8d6   2 weeks ago     12.2GB

模型轉換

為了優化模型的吞吐量和延遲，可以將其轉換為 FT 格式，其中包含對 transformer 架構中編碼器和解碼器層的性能修改。

與非 FT 對等方相比， FT 可以以 3 倍或更多的延遲來服務推理請求。 NeMo Megatron 訓練容器包括 FT 框架以及將. NeMo 文件轉換為 FT 格式的腳本。

Triton 推理服務器希望模型存儲在模型存儲庫中。模型存儲庫包含檢查點和模型特定信息， Triton Inference Server 在部署時讀取這些信息來調整模型。與 FT 框架一樣， NeMo Megatron 訓練容器包含用于將 FT 模型轉換為 Triton 的模型存儲庫的腳本。

將模型轉換為 FT 格式并為轉換后的模型創建模型存儲庫可以在 Docker 容器中一次性完成。要創建基于 FT 的模型存儲庫，請運行以下命令。可能需要更改的項目在 bold. 中

docker run --rm \
    --gpus all \
    --shm-size=16GB \
    -v /path/to/checkpoints:/checkpoints \
    -v /path/to/checkpoints/output:/model_repository \
    nvcr.io/ea-bignlp/bignlp-training:22.08.01-py3 \
    bash -c 'export PYTHONPATH=/opt/bignlp/FasterTransformer:${PYTHONPATH} && \
    cd /opt/bignlp && \
    python3 FasterTransformer/examples/pytorch/gpt/utils/nemo_ckpt_convert.py \
        --in-file /checkpoints/nemo_gpt1.3B_fp16.nemo \
        --infer-gpu-num 1 \
        --saved-dir /model_repository/gpt3_1.3b \
        --weight-data-type fp16 \
        --load-checkpoints-to-cpu 0 && \
    python3 /opt/bignlp/bignlp-scripts/bignlp/collections/export_scripts/prepare_triton_model_config.py \
        --model-train-name gpt3_1.3b \
        --template-path /opt/bignlp/fastertransformer_backend/all_models/gpt/fastertransformer/config.pbtxt \
        --ft-checkpoint /model_repository/gpt3_1.3b/1-gpu \
        --config-path /model_repository/gpt3_1.3b/config.pbtxt \
        --max-batch-size 256 \
        --pipeline-model-parallel-size 1 \
        --tensor-model-parallel-size 1 \
        --data-type bf16'

這些步驟啟動 Docker 容器以運行轉換。以下列表列出了一些重要參數及其功能：

-v /path/to/checkpoints:/checkpoints：指定保存檢查點的本地目錄。這是前面在檢查點下載步驟中提到的目錄。命令中的最后一個：/ checkins 目錄應該保持不變。
-v /path/to/checkpoint/output:/model_repository：指定要將轉換后的檢查點保存到的本地目錄。在稍后的部署步驟中使用該位置時，請記下該位置。命令中的最后一個：/ model _ repository 目錄應該保持不變。
nvcr.io/ea-bignlp/bignlp-training:22.08.01-py3：如果 NGC 上存在更新的圖像，請用新版本替換突出顯示的標簽。
--in-file /checkpoints/nemo_gpt1.3B_fp16.nemo：要轉換的下載檢查點的名稱。如果您使用的是其他版本，請在此處替換名稱。
--infer-gpu-num 1：這是用于部署模型的 GPU 的編號。如果使用多個 GPU ，請將此數量增加到所需數量。本文的其余部分假設這里使用了值 1 。
--model-train-name gpt3_1.3b：已部署模型的名稱。如果您使用不同的型號名稱，請記下新名稱，因為 NVIDIA Triton 請求需要指定名稱。
--tensor-model-parallel-size 1：如果您使用不同的 GPU 計數進行推斷，則必須更新此數字。該值應與前面的--infer-gpu-num的值匹配。

運行命令后，通過查看指定的輸出目錄來驗證模型是否已轉換。輸出應類似于以下內容（為簡潔起見，截短）：

$ ls -R output/
output/:
gpt3_1.3b

output/gpt3_1.3b:
1-gpu  config.pbtxt

output/gpt3_1.3b/1-gpu:
config.ini
merges.txt
model.final_layernorm.bias.bin
model.final_layernorm.weight.bin
...

模型部署

現在模型已經轉換為模型存儲庫，可以使用 Triton InferenceServer 進行部署。使用內置 NVIDIA Triton 的 NeMo Megatron 推理容器執行此操作。

默認情況下， NVIDIA Triton 為 HTTP 、 gRPC 和度量請求使用三個端口。

docker run --rm \
    --name triton-inference-server \
    -d \
    --gpus all \
    -p 8000-8002:8000-8002 \
    -v /path/to/checkpoints/output:/model_repository \
    nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 \
    bash -c 'export CUDA_VISIBLE_DEVICES=0 && \
    tritonserver --model-repository /model_repository'

-d：這告訴 Docker 在后臺運行容器。服務器保持在線并可用于請求，直到容器被終止。
-p 8000-8002:8000-8002： NVIDIA Triton 使用端口 8000 進行通信，用于 HTTP 請求， 8001 用于 gRPC 請求， 8002 用于度量信息。這些端口從容器映射到主機，允許主機直接處理請求并將其路由到容器。
-v /path/to/checkpoints/output:/model_repository：指定轉換后的檢查點在計算機上保存到的位置。這應該與前面轉換步驟中的模型存儲庫位置相匹配。
nvcr.io/ea-bignlp/bignlp-inference:22.08-py3：如果 NGC 上存在更新版本，請用新版本替換突出顯示的標簽。
export CUDA_VISIBLE_DEVICES=0：指定要使用的設備。如果模型之前被轉換為使用多個 GPU ，則這應該是一個逗號分隔的 GPU 列表，直到所需的數字。例如，如果您使用四個 GPU ，則應該是CUDA_VISIBLE_DEVICES=0,1,2,3。

要驗證容器是否已成功啟動，請運行 docker ps ，它應顯示類似于以下內容的輸出：

CONTAINER ID   IMAGE                                          COMMAND                  CREATED              STATUS              PORTS                                                           NAMES
f25cf23b75b7   nvcr.io/ea-bignlp/bignlp-inference:22.08-py3   "/opt/nvidia/nvidia_…"   About a minute ago   Up About a minute   0.0.0.0:8000-8002->8000-8002/tcp, :::8000-8002->8000-8002/tcp   triton-inference-server

檢查日志以查看模型是否已部署并準備好接受請求（為簡潔起見，輸出被截斷）。

$ docker logs triton-inference-server
I0928 14:29:34.011299 1 server.cc:629] 
+-----------+---------+--------+
| Model     | Version | Status |
+-----------+---------+--------+
| gpt3_1.3b | 1       | READY  |
+-----------+---------+--------+

I0928 14:29:34.131430 1 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA A100-SXM4-80GB
I0928 14:29:34.132280 1 tritonserver.cc:2176] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.24.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | /model_repository                                                                                                                                                                            |
| model_control_mode               | MODE_NONE                                                                                                                                                                                    |
| strict_model_config              | 0                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0928 14:29:34.133520 1 grpc_server.cc:4608] Started GRPCInferenceService at 0.0.0.0:8001
I0928 14:29:34.133751 1 http_server.cc:3312] Started HTTPService at 0.0.0.0:8000
I0928 14:29:34.174655 1 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

如果輸出與此處顯示的類似，則模型已準備好接收推理請求。

發送推理請求

當本地 Triton 推理服務器運行時，您可以開始向服務器發送推理請求。 NVIDIA Triton 的客戶端 API 支持多種語言，包括 Python 、 Java 和 C ++。為了本文的目的，我們提供了一個示例 Python 應用程序。

from argparse import ArgumentParser
import numpy as np
import tritonclient.http as httpclient
from tritonclient.utils import np_to_triton_dtype
from transformers import GPT2Tokenizer

def fill_input(name, data):
    infer_input = httpclient.InferInput(name, data.shape, np_to_triton_dtype(data.dtype))
    infer_input.set_data_from_numpy(data)
    return infer_input

def build_request(query, host, output):
    with httpclient.InferenceServerClient(host) as client:
        request_data = []
        request = np.array([query]).astype(np.uint32)
        request_len = np.array([[len(query)]]).astype(np.uint32)
        request_output_len = np.array([[output]]).astype(np.uint32)
        top_k = np.array([[1]]).astype(np.uint32)
        top_p = np.array([[0.0]]).astype(np.float32)
        temperature = np.array([[1.0]]).astype(np.float32)

        request_data.append(fill_input('input_ids', request))
        request_data.append(fill_input('input_lengths', request_len))
        request_data.append(fill_input('request_output_len', request_output_len))
        request_data.append(fill_input('runtime_top_k', top_k))
        request_data.append(fill_input('runtime_top_p', top_p))
        request_data.append(fill_input('temperature', temperature))
        result = client.infer('gpt3_1.3b', request_data)
        output = result.as_numpy('output_ids').squeeze()
        return output

def main():
    parser = ArgumentParser('Simple Triton Inference Requestor')
    parser.add_argument('query', type=str, help='Enter a text query to send to '
                        'the Triton Inference Server in quotes.')
    parser.add_argument('--output-length', type=int, help='Specify the desired '
                        'length for output.', default=30)
    parser.add_argument('--server', type=str, help='Specify the host:port that '
                        'Triton is listening on. Defaults to localhost:8000',
                        default='localhost:8000')
    args = parser.parse_args()

    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    query = tokenizer(args.query).input_ids
    request = build_request(query, args.server, args.output_length)
    print(tokenizer.decode(request))

if __name__ == '__main__':
    main()

在高級別上，腳本執行以下操作：

接受用戶的輸入請求，例如“你好！今天好嗎？”
使用來自 HuggingFace 的預訓練 GPT-2 標記器標記輸入。
使用幾個必需的和可選的參數構建推理請求，例如請求、溫度、輸出長度等。
向 NVIDIA Triton 發送請求。
使用前面的標記器解碼響應。

要運行代碼，需要幾個 Python 依賴項。可以通過運行以下命令來安裝這些軟件包：

$ pip3 install numpy tritonclient[http] transformers

安裝依賴項后，將代碼保存到本地文件并將其命名為 infer.py 。接下來，按如下方式運行應用程序：

$ python3 infer.py "1 2 3 4 5 6"

這將向本地推理服務器發送提示“ 1 2 3 4 5 6 ”，并應輸出以下內容以完成序列，直到默認響應令牌限制為 30 ：

“1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36"

服務器現在可以使用這個基本公式來響應任何 HTTP 請求，并且可以支持本地和遠程的多個并發請求。

總結

大型語言模型正在推動越來越多的應用程序。隨著幾個 NeMo Megatron 模型的公開發布，現在可以在本地部署經過訓練的模型。

本文概述了如何使用簡單的 Python 腳本部署公共 NeMo Megatron 模型。您可以通過下載 larger models hosted on HuggingFace 來測試更健壯的模型和用例。

有關使用 NeMo Megatron 的更多信息，請參閱 NeMo Megatron 文檔和 NVIDIA/nemo GitHub repo 。

使用 NVIDIA NeMo Megatron 部署 1.3B GPT-3 型號

系統要求

準備

模型轉換

模型部署

發送推理請求

總結

相關資源

標簽

關于作者

使用 NVIDIA NeMo Megatron 部署 1.3B GPT-3 型號

系統要求

準備

模型轉換

模型部署

發送推理請求

總結

相關資源

標簽

關于作者

相關文章

單個 GPU 上的 Mistral NeMo 12B 加速文本生成應用程序

NVIDIA AI 平臺為大型語言模型帶來巨大收益

相關文章

在 GPU 上高效部署語音 AI 模型

使用 Kubernetes 自動縮放 NVIDIA Riva 部署，用于生產中的語音 AI

使用 NVIDIA Triton 推理服務器從公共庫 ModelZoo 部署不同的 AI 模型類別

自動駕駛的最優 AI 推理流水線設計

使用 NVIDIA Triton 推理服務器支持的 Amazon SageMaker 多模型端點在同一 GPU 上運行多個 AI 模型