高效的 CUDA 調試：將 NVIDIA Compute Sanitizer 與 NVIDIA 工具擴展程序結合使用并創建自定義工具

NVIDIA Compute Sanitizer 是一款功能強大的工具，可以節省時間和精力，同時提高 CUDA 應用程序的可靠性和性能。在 CUDA 環境中調試代碼既具有挑戰性又耗時，尤其是在處理數千個線程時。Compute Sanitizer 可以提供幫助！

在這一系列的第一篇文章中，高效 CUDA 調試：使用 NVIDIA Compute Sanitizer 追蹤錯誤中，我們將討論如何開始使用 Compute Sanitizer 工具，以檢查代碼中的內存泄漏和競爭條件。

在第二篇博文中，高效的 CUDA 調試：借助 NVIDIA Compute Sanitizer 實現內存初始化和線程同步。此外，我們還探討了用于檢查內存初始化和線程同步的工具。

在本文中，我們重點介紹了 Compute Sanitizer 的一些其他功能，即它與 NVIDIA 工具擴展程序 (NVTX) 的集成，用于標記代碼，以便更直接地使用 Compute Sanitizer.我們還討論了用于 Compute Sanitizer 的 API 本身，以創建更多用于調試 CUDA 應用的工具。

NVIDIA 計算 Sanitizer

Compute Sanitizer 是一套工具，可以對代碼的功能正確性執行不同類型的檢查。主要有四個工具：

memcheck：內存訪問錯誤和泄漏檢測。
racecheck：共享內存數據訪問的危險檢測工具。
initcheck：用于檢測未初始化的設備全局內存的工具。
synccheck：線程同步風險檢測。

除這些工具外， NVIDIA Compute Sanitizer 還具有更多功能：

用于創建針對 CUDA 應用的清理和追蹤工具的 API
與 NVIDIA 工具集成 (NVTX)
Coredump 功能用于 CUDA-GDB。
抑制功能，用于管理工具的輸出

將 Compute Sanitizer 與 NVTX 結合使用

NVTX 是一種基于 C 的 API，用于標注程序中的代碼范圍、事件和資源。此標注支持在應用程序運行時收集更多信息，這些信息可用于在分析和分析代碼時改進數據呈現。Compute Sanitizer 和 NVTX 之間的集成使您能夠使用 NVTX 標注代碼，以協助 Compute Sanitizer 捕獲錯誤。

有關 NVTX 標注的更多信息，請參閱以下文章：

C/C++ 和 NVTX：CUDA Pro 提示：使用 NVTX 生成自定義應用程序配置文件時間軸
Python 和 NVTX：NVIDIA 工具擴展程序 API：用于在 Python 和 C/C++ 中分析代碼的標注工具。
Fortran 和 NVTX：使用 NVTX 自定義 CUDA Fortran 分析以提高性能和效率。

我們的 NVTX 顯存 API 使 CUDA 程序能夠將內存限制（例如內存池管理或權限限制以及內存標記）通知 Compute Sanitizer。

內存池管理

NVTX 與 Compute Sanitizer 的第一個示例是 suballocation，它是 NVTX Memory API 的一部分。

通過 API 可以將內存分配標記為內存池。Compute Sanitizer 了解這些池，并可以檢測實際正在使用特定分配的哪些部分。然后，如果在代碼執行期間訪問了內存池的任何未注冊部分，Compute Sanitizer 通過其 `memcheck` 工具來檢測這些訪問。

這是基本內存池的示例，代碼名為 `mempool_example.cu`。

#include 
 
__global__ void populateMemory(int* chunk) {
  int i = threadIdx.x + blockDim.x * blockIdx.x;
  chunk[i] = i;
}
 
int main(int argc, char **argv) {
  int poolSize   = 4096 * sizeof(int);
  int numThreads = 63;
  // int bucketSize = numThreads * sizeof(int); // You need this later ...
 
  void *pool;
  cudaMallocManaged(&pool, poolSize); // Create your memory pool
 
  // Assign part of the memory pool to the bucket
  auto bucket = (int *)pool + 16; // Address of bucket is 16 bytes into the pool
 
  // Set values in bucket
  populateMemory<<<1, numThreads>>>(bucket);
  cudaDeviceSynchronize();
  printf("After populateMemory 1: bucket 0, 1 ..  62: %d %d .. %d\n", bucket[0], bucket[1], bucket[numThreads-1]);
 
  // Set some more values in bucket
  populateMemory<<<1, numThreads + 1>>>(bucket);
  cudaDeviceSynchronize();
  printf("After populateMemory 2: bucket 0, 1 ..  63: %d %d .. %d\n", bucket[0], bucket[1], bucket[numThreads]);
 
  cudaFree(pool);
  exit(0);
}

代碼示例中，您創建了一個內存池（稱為 pool），大小為 4096 個整數。然后，您可以分配該池的一部分，并用變量來標記。`bucket` 變量指向內存池的 16 字節地址開始。

您的意圖是將 bucket 容器中的元素數量為 63。首先，您將元素數量存儲在 bucketSize 變量中。然后，使用 GPU 內核對數據桶進行填充。塊數為 1，線程數為 numThreads，這意味著 populateMemory 在 `bucket` 中執行 1 至 63 次，從而影響 `bucket` 中元素的數量。

盡管您嘗試在 `populateMemory` 內核中填寫 `bucket`，但由于您將線程數量設置為 `numThreads+1` (64)，這會導致一個額外的線程去處理 `bucket` 中的 63 個值。雖然您的意圖是確保 `bucket` 中有 63 個值，但分配 64 個值不會導致錯誤，因為實際的內存池 `bucket` 很大，足以容納額外的元素。

使用 `memcheck` 工具運行以確認潛在錯誤。我們在 NVIDIA V100 GPU 上運行，因此我們將 GPU 架構設置為 `sm_70`。您可能需要根據所運行的內容進行更改。

$ nvcc -o mempool.exe mempool_example.cu -arch=sm_70
$ ./mempool.exe
After populateMemory 1: bucket 0, 1 ..  62: 0 1 .. 62
After populateMemory 2: bucket 0, 1 ..  63: 0 1 .. 63
 
$compute-sanitizer --tool memcheck ./mempool.exe
========= COMPUTE-SANITIZER
After populateMemory 1: bucket 0, 1 ..  62: 0 1 .. 62
After populateMemory 2: bucket 0, 1 ..  63: 0 1 .. 63
========= ERROR SUMMARY: 0 errors

NVTX API 提供以下功能以幫助管理內存分配：注冊任何 `cudaMalloc` 使用 NVTX 的內存堆寄存器功能進行內存分配。此操作將內存注冊為表示可進一步細分為區域的內存范圍的堆。以下代碼展示了如何執行此操作：

首先，完成將 NVTX 與 Compute Sanitizer 結合使用所需的四個步驟。

對于 C 和 C++，NVTX 是一個僅包含報文頭的庫，不依賴任何包。通常，這些報文頭隨您首選的 CUDA 下載提供，例如工具包或 HPC SDK。然而，NVTX Memory API 是相對較新的，現在可以從 /NVIDIA/NVTX GitHub 庫獲取。未來，它將被包含在工具包中。

特別注意，nvToolsExtMem.h中尚未提供其他方法的頭文件。因此，克隆 NVTX GitHub 分支后，請檢查是否存在 `dev-mem-api`。

$ git clone --branch dev-mem-api https://github.com/NVIDIA/NVTX.git
…
$ ls NVTX/c/include/nvtx3/
nvToolsExtCuda.h    nvToolsExt.h           nvToolsExtMem.h     nvToolsExtSync.h  nvtxDetail
nvToolsExtCudaRt.h  nvToolsExtMemCudaRt.h  nvToolsExtOpenCL.h  nvtx3.hpp         nvtxExtDetail

現在，您可以在源代碼開頭添加 NVTX 和 NVTX API 頭文件：

#include 
#include 

Compute Sanitizer 需要在任何 NVTX 調用之前初始化 CUDA 運行時。無論何時在代碼中開始使用 NVTX，這都會發生，具體取決于您開始使用 NVTX 的位置。您可以使用例如 `cudaFree` 的方法來實現。

// Forces CUDA runtime initialization.
cudaFree(0);

最后，創建 NVTX 域。這些是調用 API 所必需的。目前，這些域沒有特定的功能，但將用于未來的 Compute Sanitizer 版本。

// Create the NVTX domain
auto mynvtxDomain = nvtxDomainCreateA("my-domain");

好的，這是完成的第一步。現在，使用 NVTX 將池分配注冊為內存池或堆：

nvtxMemVirtualRangeDesc_t myPoolRangeDesc = {}; // Descriptor for the
                                                // range memory pool
myPoolRangeDesc.size = poolSize; // Size of the range memory pool
myPoolRangeDesc.ptr  = pool;     // Pointer to the pool itself
 
nvtxMemHeapDesc_t myHeapDesc = {}; // Descriptor for the heap
 
myHeapDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
myHeapDesc.structSize = sizeof(nvtxMemHeapDesc_t);
myHeapDesc.usage = NVTX_MEM_HEAP_USAGE_TYPE_SUB_ALLOCATOR;
myHeapDesc.type = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
myHeapDesc.typeSpecificDescSize = sizeof(nvtxMemVirtualRangeDesc_t);
myHeapDesc.typeSpecificDesc = &myPoolRangeDesc
 
auto mynvtxPool = nvtxMemHeapRegister(mynvtxDomain, &myHeapDesc);

這些步驟已將池分配給變量 `mynvtxPool`。為了使用它之前的示例，您現在必須在池中創建二次分配以表示存儲桶。該語法與分配池本身的方式相同，但這次使用區域描述器而不是堆描述符：

nvtxMemVirtualRangeDesc_t mySubRangeDesc = {}; // Descriptor for the range
mySubRangeDesc.size = bucketSize; // Size of your suballocation (in bytes)
mySubRangeDesc.ptr  = bucket;     // Pointer to the suballocation
 
nvtxMemRegionsRegisterBatch_t myRegionsDesc = {};
myRegionsDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
myRegionsDesc.structSize  = sizeof(nvtxMemRegionsRegisterBatch_t);
myRegionsDesc.regionType  = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
myRegionsDesc.heap = mynvtxPool; // The heap you registered earlier
myRegionsDesc.regionCount = 1;
myRegionsDesc.regionDescElementSize = sizeof(nvtxMemVirtualRangeDesc_t);
myRegionsDesc.regionDescElements = &mySubRangeDesc
 
nvtxMemRegionsRegister(mynvtxDomain, &myRegionsDesc);

這既是您的內存池，也是二次分配存儲桶，bucket 已在 NVTX 中注冊。這意味著 Compute Sanitizer 可以將其屬性納入其檢查。現在，請查看它是否選擇了錯誤的填充嘗試 bucket，其值超出預期范圍。

這是包含 NVTX 注冊的完整代碼示例，名為 `mempool_nvtx_example.cu`。

#include 
#include 
 
#include 
 
__global__ void populateMemory(int* chunk) {
  int i = threadIdx.x + blockDim.x * blockIdx.x;
  chunk[i] = i;
}
 
int main(int argc, char **argv) {
  int poolSize   = 4096 * sizeof(int);
  int numThreads = 63;
  int bucketSize = numThreads * sizeof(int);
 
  // Forces CUDA runtime initialization.
  cudaFree(0);
 
  // Create the NVTX domain
  auto mynvtxDomain = nvtxDomainCreateA("my-domain");
 
 
  void *pool;
  cudaMallocManaged(&pool, poolSize); // Create your memory pool
 
  // Register the pool with NVTX
  nvtxMemVirtualRangeDesc_t myPoolRangeDesc = {}; // Descriptor for the
                                                  // range memory pool
  myPoolRangeDesc.size = poolSize; // Size of the range memory pool
  myPoolRangeDesc.ptr  = pool;     // Pointer to the pool itself
 
  nvtxMemHeapDesc_t myHeapDesc = {}; // Descriptor for the heap
 
  myHeapDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
  myHeapDesc.structSize = sizeof(nvtxMemHeapDesc_t);
  myHeapDesc.usage = NVTX_MEM_HEAP_USAGE_TYPE_SUB_ALLOCATOR;
  myHeapDesc.type = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
  myHeapDesc.typeSpecificDescSize = sizeof(nvtxMemVirtualRangeDesc_t);
  myHeapDesc.typeSpecificDesc = &myPoolRangeDesc
 
  auto mynvtxPool = nvtxMemHeapRegister(mynvtxDomain, &myHeapDesc);
 
  // Assign part of the memory pool to the bucket
  auto bucket = (int *)pool + 16; // Address of bucket is 16 bytes into the pool
 
  // Register bucket as a suballocated region in NVTX
  nvtxMemVirtualRangeDesc_t mySubRangeDesc = {}; // Descriptor for the range
  mySubRangeDesc.size = bucketSize; // Size of your suballocation (in bytes)
  mySubRangeDesc.ptr  = bucket;     // Pointer to the suballocation
 
  nvtxMemRegionsRegisterBatch_t myRegionsDesc = {};
  myRegionsDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
  myRegionsDesc.structSize  = sizeof(nvtxMemRegionsRegisterBatch_t);
  myRegionsDesc.regionType  = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
  myRegionsDesc.heap = mynvtxPool; // The heap you registered earlier
  myRegionsDesc.regionCount = 1;
  myRegionsDesc.regionDescElementSize = sizeof(nvtxMemVirtualRangeDesc_t);
  myRegionsDesc.regionDescElements = &mySubRangeDesc
 
  nvtxMemRegionsRegister(mynvtxDomain, &myRegionsDesc);
 
  // Set values in bucket
  populateMemory<<<1, numThreads>>>(bucket);
  cudaDeviceSynchronize();
  printf("After populateMemory 1: bucket 0, 1 ..  62: %d %d .. %d\n", bucket[0], bucket[1], bucket[numThreads-1]);
 
  // Set some more values in bucket
  populateMemory<<<1, numThreads + 1>>>(bucket);
  cudaDeviceSynchronize();
  printf("After populateMemory 2: bucket 0, 1 ..  63: %d %d .. %d\n", bucket[0], bucket[1], bucket[numThreads]);
 
  cudaFree(pool);
  exit(0);
}

編譯并再次通過 Compute Sanitizer 運行。編譯步驟中的 include 語句應指向 NVTX 頭文件的安裝位置。

$ nvcc -I ./NVTX/c/include -o mempool_nvtx.exe mempool_nvtx_example.cu -arch=sm_70
$ compute-sanitizer --tool memcheck --destroy-on-device-error=kernel ./mempool_nvtx.exe
========= COMPUTE-SANITIZER
After populateMemory 1: bucket 0, 1 ..  62: 0 1 .. 62
========= Invalid __global__ write of size 4 bytes
=========     at populateMemory(int *)+0x70
=========     by thread (63,0,0) in block (0,0,0)
=========     Address 0x7f2a9800013c is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7f2a98000040 of size 252 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
. . .
=========
After populateMemory 2: bucket 0, 1 ..  63: 0 1 .. 0
========= ERROR SUMMARY: 1 error

Compute Sanitizer 確實捕獲了寫入指定數組末端元素的嘗試：

Invalid __global__ write of size 4 bytes

現在，如果您有部分內存池要調整大小甚至銷毀，該怎么辦？NVTX 內存 API 還提供了對其 NVTX 注冊執行此操作的類似方法。

如需調整大小，請返回上一個池和存儲桶示例。如果您想將存儲桶的大小從 63 個元素調整為 64 個元素，請修改之前的代碼示例，并使用以下內容調整 NVTX 注冊存儲桶的大小，以反映這一點：

// Resizing the sub-allocation within the memory pool
 // You reuse mySubRangeDesc from earlier
 mySubRangeDesc.size = bucketSize + 4; // You want one extra int (4B) element
 mySubRangeDesc.ptr  = bucket;
 
 nvtxMemRegionsResizeBatch_t myNewRegionsDesc = {};
 myNewRegionsDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
 myNewRegionsDesc.structSize = sizeof(mySubRangeDesc);
 myNewRegionsDesc.regionType = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
 myNewRegionsDesc.regionDescCount = 1;
 myNewRegionsDesc.regionDescElementSize = sizeof(mySubRangeDesc);
 myNewRegionsDesc.regionDescElements = &mySubRangeDesc
 
 nvtxMemRegionsResize(mynvtxDomain, &myNewRegionsDesc);

正如您所見，它與二次分配的初始聲明類似，但使用了 `nvtxMemRegionsResize` 函數在最后

仔細檢查 Compute Sanitizer 的配置，以確保它對大小存儲桶分配值的合理調整。在兩次調用之間添加調整大小的注冊表代碼，例如在示例代碼中的 `populateMemory` 函數中進行編譯和運行。

$ nvcc -I./NVTX/c/include -o mempool_resize.exe mempool_resize_example.cu -arch=sm_70
$ compute-sanitizer --tool memcheck --destroy-on-device-error=kernel ./mempool_resize.exe
========= COMPUTE-SANITIZER
After populateMemory 1: bucket 0, 1 ..  62: 0 1 .. 62
After populateMemory 2: bucket 0, 1 ..  63: 0 1 .. 63
========= ERROR SUMMARY: 0 errors

希望您能看到類似的示例。在調整二次分配的注冊大小后，現在不會有人對嘗試訪問添加到最后的新元素提出任何投訴。

使用 NVTX 和 Compute Sanitizer 需要大量的代碼行來管理注冊池和調整池大小，以及二次分配。對于復雜的代碼，這可能會很繁瑣。為了簡化使用，可以將步驟封裝到一個單獨的類中。NVTX 和 Compute Sanitizer 庫提供了示例代碼，展示如何從這些類中獲取此類方法，這可以作為您自己代碼的起點。

此外還有兩個 NVTX API：

命名 API：允許區域或二次分配具有與其關聯的 ASCII 名稱。然后，它可用于在錯誤報告中按其名稱引用分配，目前支持此報告用于泄漏和未使用的內存報告。
訪問權限 API：用于限制分配的訪問權限，使其僅為只讀或原子訪問。

Compute Sanitizer API，用于創建您自己的工具

Compute Sanitizer 隨附 API，使您能夠針對 CUDA 應用創建自己的清理和追蹤工具。這是一組功能，您可以使用這些功能與 Compute Sanitizer 進行交互以進行控制和配置，啟用或禁用其功能，以及訪問其結果。

該 API 還為您提供了將 Compute Sanitizer 集成到開發工作流程的便捷方式，因為它可以輕松集成到現有的 CUDA 應用中。借助 Compute Sanitizer API，您可以直接利用強大的調試功能，提高 CUDA 應用的可靠性和性能。

它由以下子 API 組成：

回調：支持您在用戶代碼中注冊回調函數，這些回調函數與相關 CUDA 函數或事件相關聯，例如 `memcpy` 運算或驅動函數。訂閱者可以使用這些回調函數，例如用于事件跟蹤。
補丁：支持將補丁函數加載到在 GPU 上執行的設備代碼中。然后，它們可以用作儀器點，這意味著每當執行補丁事件時都會執行補丁函數，例如用于設置回調，例如進行內存訪問的設備代碼。
顯存：為標準 CUDA 內存 API 提供替代函數。可以在 Compute Sanitizer 回調函數中安全調用替代函數，例如使用 sanitizerAlloc() 而不是 cudaMalloc()。

這些 API 相結合，使您能夠將 Compute Sanitizer 功能整合到自己的工具中。

有關更多信息和一些示例代碼，請參閱 NVIDIA Compute Sanitizer API 指南。

結束語

使用 NVIDIA 計算 Sanitizer 立即下載 CUDA 工具包。

希望我們已經為您詳細介紹了 Compute Sanitizer 中的一些附加功能。有關更多信息，請參閱 /NVIDIA/compute-sanitizer-samples GitHub 庫和 NVIDIA Compute Sanitizer 用戶手冊。

這些近期的講座介紹了 Compute Sanitizer 中引入的一些更新功能：

從宏到微：CUDA 開發者工具在各種規模下查找并解決問題
調試 CUDA：CUDA 校正工具概述
探索用于自動駕駛汽車開發的高效工具

如果您需要支持，NVIDIA 開發者論壇是一個很好的起點。該論壇有專門的 Compute Sanitizer 論壇，專門針對 Compute Sanitizer 工具提供支持。

祝您在尋找錯誤時好運！

高效的 CUDA 調試：將 NVIDIA Compute Sanitizer 與 NVIDIA 工具擴展程序結合使用并創建自定義工具

NVIDIA 計算 Sanitizer

將 Compute Sanitizer 與 NVTX 結合使用

內存池管理

Compute Sanitizer API，用于創建您自己的工具

結束語

相關資源

標簽

關于作者

高效的 CUDA 調試：將 NVIDIA Compute Sanitizer 與 NVIDIA 工具擴展程序結合使用并創建自定義工具

NVIDIA 計算 Sanitizer

將 Compute Sanitizer 與 NVTX 結合使用

內存池管理

Compute Sanitizer API，用于創建您自己的工具

結束語

相關資源

標簽

關于作者

相關文章

高效 CUDA 調試：使用 NVIDIA Compute Sanitizer 進行內存初始化和線程同步

使用 NVIDIA Compute Sanitizer 更有效地調試 CUDA

相關文章

Kaggle 大師揭秘數據科學超能力的獲勝策略

構建應用程序以安全使用 KV 緩存

NVIDIA cuPyNumeric 25.03 現已完全開源，支持 PIP 和 HDF5

頂級大師專業提示：使用 NVIDIA cuDF-pandas 進行特征工程，在 Kaggle 競賽中奪冠

在大型語言模型時代，通過消息量化和流式傳輸實現高效的聯邦學習