Fault Handling
This topic sets expectations to application developers on production use and behavior of Vulkan SC Fault Handling.
This Vulkan SC implementation expects that user applications monitor the Vulkan SC
Fault Handling via VkFaultCallbackInfo
and
vkGetFaultData
, as referenced below.
Avoid Generating Faults
User applications must not cause Vulkan SC to report faults at
VK_FAULT_LEVEL_CRITICAL
during normal operation. When
applications correctly use the Vulkan SC API, and NVIDIA DRIVE? OS
SEooC and NVIDIA DRIVE Orin? SoC operate in normal conditions, this
Vulkan SC implementation is designed to report zero faults via the Fault
Handling interface introduced in the Khronos Vulkan SC Specification.
The rationale is that these reports are symptoms of detected faults. This Vulkan SC
implementation of fault handling reports faults at only Quality Managed (QM)
availability and automotive integrity level. Particularly for
VK_FAULT_LEVEL_CRITICAL
, the fault can indicate that the
iGPU encountered an uncorrectable error.
To continue the rationale of why to avoid this error condition, the
VkDevice
that triggered the fault of
VK_FAULT_LEVEL_CRITICAL
becomes lost, as described in
Vulkan SC Specification at 5.2.3. Lost Device. Vulkan SC API functions to
submit or wait for VkQueue
then return
VK_ERROR_DEVICE_LOST
. On Orin and QNX Safety Platform, the
system’s error response causes other VkDevice
even in separate
system processes to become lost, and also causes CUDA devices to return
errors. This degrades the availability of iGPU execution for the system (Guest VM).
An independent ASIL SW resource manager of NvGPU reports an asynchronous error to
Safety Services, via the Error Propagation Library.
During development and verification, if the implementation reports faults to an
application, application developers should investigate, and correct the usage
of Vulkan SC API. To assist in logging and debugging,
VK_NV_private_vendor_info
extension supports a
VkFaultDataDescriptionNV
structure, which contains a
description string.
Monitor SC Fault Handling
User applications must monitor the SC fault handling interface, for automotive deployment. The Vulkan SC Specification introduced a fault handling interface, at 35.1. Fault Handling. The Vulkan SC Specification identifies this interface as optional for applications, but this Vulkan SC implementation expects applications to monitor Fault Handling.
Specifically, this guide expects applications to exercise fault handling as below:
-
Application must register a callback function via
VkFaultCallbackInfo
when creating eachVkDevice
. Applications provide a function of signaturePFN_vkFaultCallbackFunction
, assign its function pointer toVkFaultCallbackInfo::pfnFaultCallback
, and “pNext-chain”VkFaultCallbackInfo
toVkDeviceCreateInfo
passed tovkCreateDevice
. -
Application must call
vkGetFaultData
for eachVkDevice
, after completing all its calls to functions in API Group: Init. -
After the application calls
vkQueueSubmit
, the application should callvkGetFaultData
, within one second, on theVkDevice
from which thatVkQueue
was received. The one-second interval is arbitrary. One call tovkGetFaultData
can query any faults after multiple prior queue submissions.
As rationale, many functions in Vulkan SC API return void, rather than VkResult return code. This interface design supports efficient execution on CCPLEX with low count of branches, for application code-paths that execute frequently, for example to record command buffers. If the Vulkan SC implementation detects faults, and the functions, which return void do not execute normally, the SC fault handling reports to application that fault(s) occurred, at best effort.
Recoverable Faults
In production, applications should not generate faults even at the lower
criticality VkFaultLevel
. Those lower levels are
VK_FAULT_LEVEL_RECOVERABLE
,
VK_FAULT_LEVEL_WARNING
, and
VK_FAULT_LEVEL_UNASSIGNED
. This minimizes the accumulation of
errors, and prevents multi-point failures, where the cause of recoverable fault is
the first point, and the fault report and handling are additional points.
This guide suggests fault handling by applications for different faults of
VK_FAULT_LEVEL_RECOVERABLE
:
- When any
vkCmd
-prefixed function causes a fault during command recording, ofVK_FAULT_TYPE_COMMAND_BUFFER_FULL
orVK_FAULT_TYPE_INVALID_API_USAGE
, Developers can expectvkEndCommandBuffer
to then return an errorVkResult
. Applications can clear the error state of theVkCommandBuffer
withvkResetCommandPool
on its pool. Until that call tovkResetCommandPool
, subsequentvkCmd
-prefixed functions to record to thatVkCommandBuffer
will be silently ignored. - For functions not prefixed with
vkCmd
, when those functions cause a fault ofVK_FAULT_TYPE_INVALID_API_USAGE
, the API function skips its nominal behavior. Developers can expect that retry with the exact same parameters will repeat the fault report. Applications can retry that function, or skip that call and enter alternate behavior.