Detailed Description

GPU management and access API.

For related information, see Discrete GPU in the Development Guide.

General design of nvrm_gpu

Object-oriented design

nvrm_gpu follows object-oriented design. In general, objects are created by API functions ending with "Open" or "Create" and they are destroyed by API functions ending with "Close". For example NvRmGpuLibOpen() creates the library object and NvRmGpuLibClose() destroys the object.

The objects typically have a real-world counterpart or they represent a logical construct. For example, a device object represents an NVIDIA GPU.

All objects are referred by handles. A handle is a pointer-to-struct type where the struct provides typing. For example, a library object is referred by a variable of type NvRmGpuLib *. Handles cannot be dereferenced by the nvrm_gpu user.

The handle lifetime is the same as the underlying object lifetime. That is, a valid handle is created by an Open/Create function and a Close function will invalidate the handle as well as destroy the object. For example, NvRmGpuDeviceOpen() creates a device object and returns a handle to the device. NvRmGpuDeviceClose() destroys the device object and invalidates the device handle.

In general, before destroying an object, any possible child object must be destroyed first. For example, when closing the library with NvRmGpuLibClose(), all devices opened within the library scope must be closed first with NvRmGpuDeviceClose(). Failing to do so will produce undefined behavior in form of dangling pointers in nvrm_gpu internal objects and data structures.

Unless otherwise stated, it is illegal to cross-reference handles under different NvRmGpuDevice instances. For instance, a channel belonging to one device object cannot be bound to the address space belonging to another device object.

Common error codes

nvrm_gpu uses the following common error codes throughout the API:

Error code	Build configurations	Description	Remarks
NvError_GpuInvalidHandle	Safety builds only	Invalid API object handle	nvrm_gpu standard does not perform handle validation for performance reasons.
NvError_GpuInvalidDvmsState	QNX builds only	API function not available in the current DVMS state	None
NvError_GpuHwError	All	Error encountered during direct communication with the HW.	Examples: Unexpected register read value; unexpected control buffer value
NvError_GpuOutOfOrderFree	Safety builds only	A parent (or other dependency) object freed before children (or other dependencies).	None
NvError_InsufficientMemory	All	Insufficient memory to perform the operation.	None
NvError_ResourceError	All	Error communicating with a kernel-mode driver (Linux) or a resource manager (QNX)	None
NvError_GpuFatalLockdown	Safety builds only	nvrm_gpu is in lockdown mode due to a previous fatal error.	Once a fatal error has been encountered, most API calls return this error. The exceptions are API calls that never fail (e.g. NvRmGpuLibGetVersionInfo()) and API calls that always fail (e.g., calls disabled in the build). When a fatal error has been encountered, the integrity of nvrm_gpu is considered lost. The lockdown mode prevents any further undefined behavior.
NvError_GpuFatalConsistencyError	Safety builds only	Internal state consistency check failed.	Potential sources include memory corruption or a programming defect.
NvError_GpuFatalLogicError	Safety builds only	Internal logic error detected.	These errors are generally triggered due defensive mechanisms. For example: internal out-of-bounds check failed; integer overflow check failed; predicate check failed.
NvError_GpuFatalOsError	Safety builds only	An OS call or a library call that should never fail with correct programming returned an error.	Example: pthread_mutex_lock() or pthread_mutex_unlock() returned an error.
NvError_GpuFatalUncheckedException	Safety builds only	An unexpected (unchecked) exception was caught.	None

General interface contract

Parameter sanity checking:

nvrm_gpu will do general sanity checking for all received parameters with the exception of pointer validity. The sanity checking includes basic range checking for integer values, validity checking for enumeration values, and so on. Sanity checking may be delegated to the underlying kernel driver, other shared libraries used by nvrm_gpu, or the GPU itself.
Pointers passed as parameters are not validated. nvrm_gpu is a userspace component and it has no efficient way of validating pointers. This includes also the nvrm_gpu object handles. All non-NULL pointers are expected to be valid.

Objects and their relations:

In object creation, the first parameter specifies the parent. Some objects may be created for parents of different types. When this is the case, it is specifically stated in the API.
It is a fatal error to close a parent or otherwise related object before the child object, unless otherwise stated. The parent and other related objects are either referenced in the Open/Create functions or specific other functions that bind objects together.
The Close functions can always be supplied with NULL handles, unless otherwise stated. Closing a NULL handle is a no-operation. This allows the nvrm_gpu users to unconditionally close all handles in deinitialization, regardless of whether they refer to valid objects or NULL. (See NvRmGpuLibClose() for an example.)

Handles:

nvrm_gpu generally assumes that all nvrm_gpu handles passed to the nvrm_gpu API functions are valid, correctly typed, and non-NULL. Individual API functions may relax the non-NULL requirement by marking input handles as optional in the API function documentation, in which case a NULL pointer may be passed, instead.
Generally, a valid handle is obtained from an API function named "create" or "open". Valid handle is invalidated by calling an API function named "close".

Pointers:

All passed pointers to memory are expected to be pointers to the regular memory type. On AArch64, this is the Normal memory type. In particular, Device memory types (e.g., Device-GRE) are excluded. All dereferenceable pointers received from nvrm_gpu are regular memory types unless otherwise stated.
All passed pointers are expected to be aligned by the C platform ABI rules.
Some nvrm_gpu API functions return const pointers as return values. In general, the pointers are valid as long as the related object is valid unless otherwise stated. And in particular, the caller shall not attempt to free them. For example, the device list returned by NvRmGpuLibListDevices() is valid until the related library handle is closed.

Forwards and backwards compatibility:

Many nvrm_gpu functions expect a pointer to an attribute struct. The attribute struct provides room for future extensibility while maintaining source-level backwards compatibility. Generally, a NULL pointer is also accepted in which case nvrm_gpu assumes default attributes. The nvrm_gpu users should define the attribute structs with macros provided by the nvrm_gpu API. (See NvRmGpuDeviceOpen() and NVRM_GPU_DEFINE_DEVICE_OPEN_ATTR() for instance.)
In general, nvrm_gpu attempts to maintain source-level backwards compatibility. This means that the user may upgrade nvrm_gpu without changes in the nvrm_gpu users' sources. However, a recompilation may be needed, as nvrm_gpu does not provide binary compatibility.
The reasons for breaking the backwards compatibility are usually either due to security reasons or that the functionality has been completely removed.

Concurrency and thread safety

nvrm_gpu is thread-safe with the following rule: an object may be closed only if there are no concurrent operations on-going on it. Attempting to close an object in one thread when another thread is still accessing it is a fatal error.

nvrm_gpu internally uses fine-grain locking to promote high-performance multi-threaded programming. The nvrm_gpu implementation uses the partial lock ordering technique to avoid deadlocks related to nested locking.

Safety-certified subset

nvrm_gpu is subject to safety certification for specific releases. The nvrm_gpu library in the safety-certified release comes in form of special build called the safety build of nvrm_gpu. The safety build supports a subset of the nvrm_gpu API functionality. This is referred as the safety subset.

The safety subset available in the safety build is denoted by API groups that have "safety subset" in their name. Functionality that is not within the "safety subset" groups is not available in the safety build and must not be attempted to be used in a safety-critical context. The top-level safety subset API group is GPU access API (safety subset).

The use of nvrm_gpu in a safety-critical context is further subject to certain assumptions, restrictions, and recommendations. These are described in the safety manual provided in the safety-certified release.

The safety build of nvrm_gpu has additional internal checks enabled that are not available in the regular builds for performance reasons.

Modules
	GPU access API: Device management
	Device control, device capabilities, and device memory management.

	GPU access API: Library
	Library management and device discovery.

NVIDIA DRIVE OS Linux SDK API Reference

6.0.9 Release