Compute Graph And Constraints

The application requirements are captured by a compute graph, where each node in the compute graph specifies an atomic task on a single hardware engine. This section will walk you through the graph specification as captured by our YAML schema. Note that YAML treats everything as a list or a dictionary. Ordering of fields does not matter; the compiler is built to accept the fields in any order specified if the expected nesting hierarchy is satisfied. Note that IDs specified cannot contain a period (.) symbol as that symbol is used internally by the Framework.

STM Version

This field specifies the input specification version number. This is required to ensure that incompatible features are disabled/flagged. Ideally it should match the version of the provided compiler package. This field is mandatory.

Version: 2.0.0 # Input specification version - currently 2.0.0

Graph ID

The graph ID is the second top-level entry in the YAML (first being the Version). The graph definition is nested under the graph ID.

Version: 2.0.0 # Input specification version - currently 2.0.0
SimpleGraph: <..Graph Description..>

Global Resources

Resources that are used system-wide are modeled under the global resources section. Hardware resources like CPUs or GPUs should go under this section. Any system-wide virtual scheduling mutexes can also be listed here. The compiler models each resource as a timeline on which only one runnable can execute at any time. A runnable can, however, use more than one resource. There are some limitations on the types of resources which can be simultaneously used by a runnable, which are covered in this section. Generally, resources are specified in the following format. Certain types of resources have additional features which are described in the respective sections. Global resources are nested under a Resources section under the graph ID. To define a resource, a resource type needs to be specified. Resource instances are grouped under the appropriate resource type. YAML supports two ways of specifying lists, which are shown in the example below.

SimpleGraph:
Resources:
Resource_Type0: [Rsrc_Type0_Instance0, Rsrc_Type0_InstanceN]
Resource_Type1:
    - Rsrc_Type1_Instance0
    - Rsrc_Type1_Instance1

The CPU, GPU and VPU resource types are known resource types for the compiler, and it will take specialized scheduling steps for runnables scheduled on those resources. Other resource types are considered as scheduling mutexes, and they do not have any naming restrictions.

CPU Resource Type

To specify CPUs in the system, the resource type should be set to CPU and the resource instances should be named as CPUX, where X is a valid CPU number.

SimpleGraph:
Resources:
    CPU: [CPU0, CPU1, CPU2]

GPU and VPU Resource Type

Graphical Processing Units (GPUs) and Vector Processing Units (VPUs) are special hardware accelerators that can be used to offload computation from the CPUs. Work is submitted to these engines through CUDA Streams and PVA Streams respectively. Supported device IDs are GPUX and VPUX, where X is a valid instance number. While specifying the device instances, an optional limit can be specified to enforce a limitation on the number of streams/queues mapped to that device instance. To specify a Stream/Queue limit on an instance, append the instance ID with: Y, where Y is the limit. In the following example, instance GPU0 allows unlimited CUDA Streams, whereas GPU1 allows only 8 Streams.

SimpleGraph:
Resources:
    GPU:
        - GPU0 # Unlimited Streams
        - GPU1: 8 # 8 Streams

Scheduling Mutex Resource Type

Any resource type not known by the compiler is modeled as a scheduling mutex. There are no naming conventions associated with either the resource type or the resource ID for a scheduling mutex. Interfering runnables can specify a scheduling mutex as a resource requirement to prevent the compiler from scheduling them concurrently.

SimpleGraph:
Resources:
    # Can be used to mutually exclude memory-intensive tasks
    MEMORY_BUS: [MEMORY_BUS0]
    # Scheduling mutexes
    MUTEX: [SCHED_MUTEX0, SCHED_MUTEX1]

Hyperepochs

A hyperepoch is a resource partition that runs a fixed configuration of epochs that share the resources in that partition. It is periodic in nature, and it respawns the contained epochs at the specified period. This relationship between the hyperepoch and its member epochs will be covered in the Epochs section. To define a hyperepoch, the required fields are Resources, Period and Epochs. In certain configurations, some fields can be omitted as specified in the respective sections. Hyperepochs are specified in a list under the ‘Hyperepochs’ keyword inside the Graph specification as shown below. Hyperepoch0 is the ID of the hyperepoch that is defined in the following graph.

SimpleGraph:
Hyperepochs:
- Hyperepoch0: # Hyperepoch ID
    Period: 100ms
    Resources:
    - MEMORY_BUS0
    - GPU0
    - CPU0
    - CPU1

Period

The period for a hyperepoch specifies the rate at which the contained epochs are spawned. This field can be omitted if the hyperepoch has only one epoch, and the periodicity of the hyperepoch is equal to that of the contained epoch. The period field is nested under the hyperepochs ID.

Resources

Each hyperepoch is associated with a mutually exclusive set of resources. Resources are mapped to hyperepochs by specifying the resource IDs in a list under the Resources heading inside the hyperepoch specification as shown in the example above. There, the resources MEMORY BUS0, GPU0, CPU0 and CPU1 are mapped to the hyperepoch Hyperepoch0. If there is only one hyperepoch in the system, this resource-mapping can be omitted and the hyperepoch is assumed to have access to all the resources in the system.

Epochs

Epochs are time bases at which rate constituent runnables spawn confined to the boundaries of the hyperepoch. Each epoch is a member of a hyperepoch, and has two attributes associated with it -Period and Frames. For specifying epochs, list epoch IDs under the Epochs heading in a hyperepoch as shown below.

Hyperepochs:
- Hyperepoch0:
    Period: 100ms
    Epochs:
    - Epoch0:
        Period: 10ms
        Frames: 8
    - Epoch1:
        Period: 100ms
    - Epoch2:
        Period: 33ms
        Frames: 3

Frames and Period for Epochs

The period specified for the epoch specifies the rate at which a frame of runnables is spawned, up to the number of frames specified, in the hyperepochs period. By default, if not specified, the number of frames is 1. In the example given above, Epoch0 spawns 8 frames in 100ms. Each frame is spawned 10ms apart. Epoch1 spawns once, as the number of frames defaults to 1, in the hyperepoch. Epoch2 spawns thrice, 33ms apart. If periodicity is not required at the epoch level, it can be omitted, and the number of frames would specify the number of times the epoch’s set of runnables needs to be spawned. This can be used to figure out the number of frames that can fit inside the hyperepochs period. The following example shows how a system can use hyperepochs to define different frequency domains.

Version: 2.0.0
Drive: # Graph ID
Resources: # Global Resources
    CPU: [CPU0, CPU1, CPU2]
    GPU: [GPU0]
Hyperepochs:
    - Perception: # Hyperepoch ID
        Period: 100ms # Hyperepoch period
        Resources: [CPU1, CPU2, GPU0] # Resource mapping
        Epochs:
            - Camera: # Epoch ID
                Period: 33.33ms
                Frames: 3
            - Radar: # Epoch ID
                Period: 100ms
                Frames: 1
    - Control: # Hyperepoch ID; Hyperepoch
        Resources: [CPU0] # period inferred from epoch.
        Epochs:
            - VDC: # Epoch ID
                Period: 10ms # Epoch frames = 1 (default)

This configuration has been visualized in the following figure. Note that Camera and Radar frames are synchronized with each other at the hyperepoch boundary, VDC frames are not aligned with either the Camera or Radar frames as they are in a separate hyperepoch with a different time base.

Visualization of the Hyperepoch configuration

Clients

Hyperepochs and epochs define the timing boundaries for tasks (runnables). Clients define the data boundaries. A client is an operating system process that contains software resources (like CUDA streams) and runnables. Clients are specified in the graph specification section under the Clients header. Each client specifies contained software resources (if any). Clients also list the epochs contained in that client and runnables associated with each epoch. In general, a typical client would be specified as follows:

Version: 2.0.0
Drive: # Graph ID
Clients:
- Client0: # Client ID
    Resources: # Client0s internal resources
    # Resource Definition
    Epochs: # Epochs present in this client
    - Perception.Camera: # Epoch Global ID - <HyperepochID.EpochID>
        Runnables: # Runnables present in Perception.Camera
        - ReadCamera: # Runnable ID (Unique inside a client)
        # Runnable specification...
        - RunnableN:
        # Runnable specification...
    - Perception.Radar:
        Runnables: # Runnables present in Perception.Radar
        - ProcessRadar:
        # Runnable specification...

Client Resources

Clients can specify resources that are visible to runnables locally. These resources cannot be accessed by runnables in other clients. Global resources are visible to all runnables. Process-specific resources like CUDA streams, and PVA streams are some examples of client resources that cannot be shared across different clients. Also, internal scheduling mutexes can also be modeled here. These resources are specified in a format like that of Global Resources.

Clients:
    - Client1:
    Resources:
        ResourceType0:
        - ResourceType0_Instance0
        - ResourceType0_InstanceN
        ResourceTypeN:
        - ResourceTypeN_Instance0
        - ResourceTypeN_InstanceN

Resource Type: CUDA Stream, PVA Stream

CUDA streams, and PVA streams are client-specific software resources that are mapped to corresponding hardware engines (GPU and VPU respectively). To specify these resources, the resource types should be set to CUDA_STREAM or PVA_STREAM respectively. The hardware engine mapping is conveyed to the compiler when specifying the resource instances as shown in the example below. The specified hardware resource instances should be specified under the corresponding hardware resource type in the Global Resources section. Note that the compiler will throw an error if the limits on the mapped resource (as specified in the GPU VPU Global Resource section) are violated.

Clients:
- Client0:
    Resources:
        CUDA_STREAM:
        - CUDA_STREAM0: GPU0 # CUDA_STREAM0 mapped to GPU0
        - CUDA_STREAM1: GPU0 # CUDA_STREAM1 mapped to GPU0
        PVA_STREAM: #A client can have one unique stream per VPU
        - PVA_STREAM0: VPU0 # PVA_STREAM0 mapped to VPU0

Resource Type: Local Scheduling Mutex

Resource types other than those specified in Global Resources section above are treated as local scheduling mutexes. These cannot be mapped to a hardware resource.

Clients:
    - Client0:
        Resources:
        LOCAL_SCHED_MUTEX:
        - LOCAL_SCHED_MUTEX0
        LOCAL_RESOURCE_MUTEX:
        - RESOURCE_MUTEX0

Here's a visual representation of the data, timing and resource boundries in STM:

Boundaries in NVIDIA System Task Manager

Runnables

Tasks in the system executed on hardware engines are known as runnables. Ideally each runnable should only use a single hardware engine. Currently synchronous runnables (runnable that use multiple hardware engines simultaneously) are not supported. Runnables require resources for execution and can be dependent on other runnables. Care must be taken to ensure that dependencies do not introduce a cycle in the graph. Depending on the type of engine used, the compiler classifies each runnable into one of the three following classes:

Submitter: A runnable that runs on a CPU and submits another runnable that runs on a different hardware resource. E.g.: A CUDA kernel that launches work on the GPU.
Submittee: A runnable that is submitted by a submitter to be run on a particular hardware resource. Following the example above, the CUDA task launched on the GPU is termed as a Submittee runnable.
Runnable: Any task that cannot be classified as a Submitter or a Submittee.

For each runnable, the following parameters can be set:

WCET (Worst Case Execution Time): The framework assumes that runnables have a bounded runtime. This runtime is captured by the WCET parameter. [Required Parameter]
StartTime: The start time of a runnable can be offset by this value from the beginning of its epoch.
Deadline: Suggested deadline to the compiler. The compiler will attempt to schedule this runnable before this deadline.
Resources: List of resources that a runnable needs. Currently, a runnable can request only one hardware resource. However, there are no limitations on the number of software resources that a runnable can ask for. The requirement can be specified as a resource type (e.g., GPU) if the developer does not care for a specific instance of a resource, or as a resource instance (e.g., GPU0). [Required Parameter]
Dependencies: List of runnables that need to be completed before this runnable can be scheduled. These runnables are specified as ClientID.RunnableID.

Submits: This field is present in Submitter runnable specification. It specifies the Submittee runnable ID. In a Submitter-Submittee pair, the Submits field should be populated. It is not necessary to add the Submitter runnable in the Submittees dependencies list.

Clients:
    - Client0:
        Resources:
        CUDA_STREAM:
        - CUDA_STREAM0: GPU0
        Epochs:
        - Perception.Camera: # Camera epoch in Perception Hyperepoch
            Runnables:
            - ReadCamera: # Normal runnable
                WCET: 10us
                Resources:
                - CPU # This runnable runs on a CPU
            - PreProcessImage: # Submitter runnable
                WCET: 20ms
                StartTime: 1ms # Starts 1ms after the camera epoch
                Resources: # GPU Submitter needs CPU0 and a stream
                - CPU0
                CUDA_STREAM
                Dependencies: [Client0.ReadCamera] # Depends on
                ReadCamera
                Submits: Client0.PreProcessGPUWork # Mentions
                submittee
            - PreProcessGPUWork: # Submittee runnable
                WCET: 5000ns
                Deadline: 30ms # Hint to schedule this before 30ms
                Resources: [GPU]
                Dependencies:
                    - Client0.PreProcessImage # Optional for submittees
                # Note: Inter-epoch dependencies are currently not
                # supported. Inter-client dependencies are supported.
        - Perception.Radar: # Radar epoch in Perception Hyperepoch
                Runnables:
                - ProcessRadar:
                # Runnable specification...

Round Robining

The runnables inside an epoch can be round robinned, such that different runnables execute in the same slot for different frames.

Version: 2.0.0
Drive: # Graph ID
Resources: # Global Resources
    CPU: [CPU0, CPU1, CPU2]
    GPU: [GPU0]
Hyperepochs:
    - Perception: # Hyperepoch ID
        Period: 100ms # Hyperepoch period
        Resources: [CPU1, CPU2, GPU0] # Resource mapping
        Epochs:
        - Camera: # Epoch ID
            AliasGroups:
            - parent_round_robin:
                Steps: [client0.n2, client0.n4]
            - child_round_robin:
                Steps: [client0.n3, client0.n5]
            Period: 33.33ms
            Frames: 3
        - Radar: # Epoch ID
            Period: 100ms
            Frames: 1
        - Control: # Hyperepoch ID; Hyperepoch
            Resources: [CPU0] # period inferred from epoch.
            Epochs:
            - VDC: # Epoch ID
                Period: 10ms # Epoch frames = 1 (default)

In the above example, n2 and n3 will execute in even frames while n4 and n5 will execute in odd frames. These conditions must hold for steps:

The runnables must belong to the same client
The runnables must belong to same epoch
The runnables must have the same resources
Any other round robinned runnable that is a direct parent or child must have the same step length.

This section covered the skeleton for timing specification in the graph. In the following sections, we will cover the specification of tasks that adhere to these timing specs.