Fault Handling

This topic sets expectations to application developers on production use and behavior of Vulkan SC Fault Handling.

This Vulkan SC implementation expects that user applications monitor the Vulkan SC Fault Handling via VkFaultCallbackInfo and vkGetFaultData, as referenced below.

Avoid Generating Faults

User applications must not cause Vulkan SC to report faults at VK_FAULT_LEVEL_CRITICAL during normal operation. When applications correctly use the Vulkan SC API, and NVIDIA DRIVE? OS SEooC and NVIDIA DRIVE Orin? SoC operate in normal conditions, this Vulkan SC implementation is designed to report zero faults via the Fault Handling interface introduced in the Khronos Vulkan SC Specification.

The rationale is that these reports are symptoms of detected faults. This Vulkan SC implementation of fault handling reports faults at only Quality Managed (QM) availability and automotive integrity level. Particularly for VK_FAULT_LEVEL_CRITICAL, the fault can indicate that the iGPU encountered an uncorrectable error.

To continue the rationale of why to avoid this error condition, the VkDevice that triggered the fault of VK_FAULT_LEVEL_CRITICAL becomes lost, as described in Vulkan SC Specification at 5.2.3. Lost Device. Vulkan SC API functions to submit or wait for VkQueue then return VK_ERROR_DEVICE_LOST. On Orin and QNX Safety Platform, the system’s error response causes other VkDevice even in separate system processes to become lost, and also causes CUDA devices to return errors. This degrades the availability of iGPU execution for the system (Guest VM). An independent ASIL SW resource manager of NvGPU reports an asynchronous error to Safety Services, via the Error Propagation Library.

During development and verification, if the implementation reports faults to an application, application developers should investigate, and correct the usage of Vulkan SC API. To assist in logging and debugging, VK_NV_private_vendor_info extension supports a VkFaultDataDescriptionNV structure, which contains a description string.

Monitor SC Fault Handling

User applications must monitor the SC fault handling interface, for automotive deployment. The Vulkan SC Specification introduced a fault handling interface, at 35.1. Fault Handling. The Vulkan SC Specification identifies this interface as optional for applications, but this Vulkan SC implementation expects applications to monitor Fault Handling.

Specifically, this guide expects applications to exercise fault handling as below:

Application must register a callback function via VkFaultCallbackInfo when creating each VkDevice. Applications provide a function of signature PFN_vkFaultCallbackFunction, assign its function pointer to VkFaultCallbackInfo::pfnFaultCallback, and “pNext-chain” VkFaultCallbackInfo to VkDeviceCreateInfo passed to vkCreateDevice.
Application must call vkGetFaultData for each VkDevice, after completing all its calls to functions in API Group: Init.
After the application calls vkQueueSubmit, the application should call vkGetFaultData, within one second, on the VkDevice from which that VkQueue was received. The one-second interval is arbitrary. One call to vkGetFaultData can query any faults after multiple prior queue submissions.

As rationale, many functions in Vulkan SC API return void, rather than VkResult return code. This interface design supports efficient execution on CCPLEX with low count of branches, for application code-paths that execute frequently, for example to record command buffers. If the Vulkan SC implementation detects faults, and the functions, which return void do not execute normally, the SC fault handling reports to application that fault(s) occurred, at best effort.

Recoverable Faults

In production, applications should not generate faults even at the lower criticality VkFaultLevel. Those lower levels are VK_FAULT_LEVEL_RECOVERABLE, VK_FAULT_LEVEL_WARNING, and VK_FAULT_LEVEL_UNASSIGNED. This minimizes the accumulation of errors, and prevents multi-point failures, where the cause of recoverable fault is the first point, and the fault report and handling are additional points.

This guide suggests fault handling by applications for different faults of VK_FAULT_LEVEL_RECOVERABLE:

When any vkCmd-prefixed function causes a fault during command recording, of VK_FAULT_TYPE_COMMAND_BUFFER_FULL or VK_FAULT_TYPE_INVALID_API_USAGE, Developers can expect vkEndCommandBuffer to then return an error VkResult. Applications can clear the error state of the VkCommandBuffer with vkResetCommandPool on its pool. Until that call to vkResetCommandPool, subsequent vkCmd-prefixed functions to record to that VkCommandBuffer will be silently ignored.
For functions not prefixed with vkCmd, when those functions cause a fault of VK_FAULT_TYPE_INVALID_API_USAGE, the API function skips its nominal behavior. Developers can expect that retry with the exact same parameters will repeat the fault report. Applications can retry that function, or skip that call and enter alternate behavior.