Compute Graph And Constraints

The application requirements are captured by a compute graph, where each node in the compute graph specifies an atomic task on a single hardware engine. This section will walk you through the graph specification as captured by our YAML schema. Note that YAML treats everything as a list or a dictionary. Ordering of fields does not matter; the compiler is built to accept the fields in any order specified if the expected nesting hierarchy is satisfied. Note that IDs specified cannot contain a period (.) symbol as that symbol is used internally by the Framework.

STM Version

This field specifies the input specification version number. This is required to ensure that incompatible features are disabled/flagged. Ideally it should match the version of the provided compiler package. This field is mandatory.

Version: 3.0.0 # Input specification version - currently 3.0.0

Graph ID

The graph ID is the second top-level entry in the YAML (first being the Version). The graph definition is nested under the graph ID.

Version: 3.0.0 # Input specification version - currently 3.0.0
SimpleGraph: <..Graph Description..>

Identifier

The identifier is a numeric ID for the graph that is used for tracking schedules across schedule switches. It is a required parameter under the graph ID.

SimpleGraph:
    Identifier: 101  # A unique integer ID specified by the user

Global Resources

Resources that are used system-wide are modeled under the global resources section. Hardware resources like CPUs, GPUs, or PVAs should go under this section. Any system-wide virtual scheduling mutexes can also be listed here. The compiler models each resource as a timeline on which only one runnable can execute at any time. A runnable can, however, use more than one resource. There are some limitations on the types of resources which can be simultaneously used by a runnable, which are covered in this section. Generally, resources are specified in the following format. Certain types of resources have additional features which are described in the respective sections. Global resources are nested under a Resources section under the graph ID. To define a resource, a resource type needs to be specified. Resource instances are grouped under the appropriate resource type. YAML supports two ways of specifying lists, which are shown in the example below.

SimpleGraph:
    Resources:
        Resource_Type0: [Rsrc_Type0_Instance0, Rsrc_Type0_InstanceN]
        Resource_Type1:
        - Rsrc_Type1_Instance0
        - Rsrc_Type1_Instance1

Hardware resources specified in the CPU Resource Type and Hardware Accelerator Resource Types sections are known resource types for the compiler and require specialized steps to schedule runnables on these resources. Other resource types are considered as scheduling mutexes, and they do not have any naming restrictions.

CPU Resource Type

To specify CPUs in the system, the resource type should be set to CPU and the resource instances should be named as CPUX, where X is a valid CPU number.

SimpleGraph:
    Resources:
        CPU: [CPU0, CPU1, CPU2]

Hardware Accelerator Resource Types

The following hardware accelerators can be used to offload computation from the CPUs:

GPU: Work can be submitted to Graphical Processing Units via CUDA Streams.
VPU: Work can be submitted to Vector Processing Units via PVA Streams.

Supported device IDs are GPUX and VPUX, where X is a valid instance number. While specifying the device instances, an optional limit can be specified to enforce a limitation on the number of streams/queues mapped to that device instance. To specify a Stream/Queue limit on an instance, append the instance ID with: Y, where Y is the limit. In the following example, instance GPU0 allows unlimited CUDA Streams, whereas GPU1 allows only 8 Streams.

SimpleGraph:
    Resources:
        GPU:
        - GPU0 # Unlimited Streams
        - GPU1: 8 # 8 Streams

Scheduling Mutex Resource Type

Any resource type not known by the compiler is modeled as a scheduling mutex. There are no naming conventions associated with either the resource type or the resource ID for a scheduling mutex. Interfering runnables can specify a scheduling mutex as a resource requirement to prevent the compiler from scheduling them concurrently.

SimpleGraph:
    Resources:
        # Can be used to mutually exclude memory-intensive tasks
        MEMORY_BUS: [MEMORY_BUS0]
        # Scheduling mutexes
        MUTEX: [SCHED_MUTEX0, SCHED_MUTEX1]

Hyperepochs

A hyperepoch is a resource partition that runs a fixed configuration of epochs that share the resources in that partition. It is periodic in nature, and it respawns the contained epochs at the specified period. This relationship between the hyperepoch and its member epochs will be covered in the Epochs section. To define a hyperepoch, the required fields are Resources, Period and Epochs. In certain configurations, some fields can be omitted as specified in the respective sections. Hyperepochs are specified in a list under the ‘Hyperepochs’ keyword inside the Graph specification as shown below. Hyperepoch0 is the ID of the hyperepoch that is defined in the following graph.

SimpleGraph:
    Hyperepochs:
    - Hyperepoch0: # Hyperepoch ID
        Period: 100ms
        Resources:
        - MEMORY_BUS0
        - GPU0
        - CPU0
        - CPU1

Period

The period for a hyperepoch specifies the rate at which the contained epochs are spawned. This field can be omitted if the hyperepoch has only one epoch, and the periodicity of the hyperepoch is equal to that of the contained epoch. The period field is nested under the hyperepochs ID.

Resources

Each hyperepoch is associated with a mutually exclusive set of resources. Resources are mapped to hyperepochs by specifying the resource IDs in a list under the Resources heading inside the hyperepoch specification as shown in the example above. There, the resources MEMORY BUS0, GPU0, CPU0 and CPU1 are mapped to the hyperepoch Hyperepoch0. If there is only one hyperepoch in the system, this resource-mapping can be omitted and the hyperepoch is assumed to have access to all the resources in the system.

Epochs

Epochs are time bases at which rate constituent runnables spawn confined to the boundaries of the hyperepoch. Each epoch is a member of a hyperepoch, and has two attributes associated with it -Period and Frames. For specifying epochs, list epoch IDs under the Epochs heading in a hyperepoch as shown below.

Hyperepochs:
- Hyperepoch0:
    Period: 100ms
    Epochs:
    - Epoch0:
        Period: 10ms
        Frames: 8
    - Epoch1:
        Period: 100ms
    - Epoch2:
        Period: 33ms
        Frames: 3

Frames and Period for Epochs

The period specified for the epoch specifies the rate at which a frame of runnables is spawned, up to the number of frames specified, in the hyperepochs period. By default, if not specified, the number of frames is 1. In the example given above, Epoch0 spawns 8 frames in 100ms. Each frame is spawned 10ms apart. Epoch1 spawns once, as the number of frames defaults to 1, in the hyperepoch. Epoch2 spawns thrice, 33ms apart. If periodicity is not required at the epoch level, it can be omitted, and the number of frames would specify the number of times the epoch’s set of runnables needs to be spawned. This can be used to figure out the number of frames that can fit inside the hyperepochs period. The following example shows how a system can use hyperepochs to define different frequency domains.

Version: 3.0.0
Drive: # Graph ID
    Resources: # Global Resources
        CPU: [CPU0, CPU1, CPU2]
        GPU: [GPU0]
    Hyperepochs:
        - Perception: # Hyperepoch ID
            Period: 100ms # Hyperepoch period
            Resources: [CPU1, CPU2, GPU0] # Resource mapping
            Epochs:
                - Camera: # Epoch ID
                    Period: 33.33ms
                    Frames: 3
                - Radar: # Epoch ID
                    Period: 100ms
                    Frames: 1
        - Control: # Hyperepoch ID; Hyperepoch
            Resources: [CPU0] # period inferred from epoch.
            Epochs:
                - VDC: # Epoch ID
                    Period: 10ms # Epoch frames = 1 (default)

This configuration has been visualized in the following figure. Note that Camera and Radar frames are synchronized with each other at the hyperepoch boundary, VDC frames are not aligned with either the Camera or Radar frames as they are in a separate hyperepoch with a different time base.

Visualization of the Hyperepoch configuration

Clients

Hyperepochs and epochs define the timing boundaries for tasks (runnables). Clients define the data boundaries. A client is an operating system process that contains software resources (like CUDA streams) and runnables. Clients are specified in the graph specification section under the Clients header. Each client specifies contained software resources (if any). Clients also list the epochs contained in that client and runnables associated with each epoch. In general, a typical client would be specified as follows:

Version: 3.0.0
Drive: # Graph ID
    Clients:
    - Client0: # Client ID
        Resources: # Client0s internal resources
        # Resource Definition
        Epochs: # Epochs present in this client
        - Perception.Camera: # Epoch Global ID - <HyperepochID.EpochID>
            Runnables: # Runnables present in Perception.Camera
            - ReadCamera: # Runnable ID (Unique inside a client)
            # Runnable specification...
            - RunnableN:
            # Runnable specification...
        - Perception.Radar:
            Runnables: # Runnables present in Perception.Radar
            - ProcessRadar:
            # Runnable specification...

Client Resources

Clients can specify resources that are visible to runnables locally. These resources cannot be accessed by runnables in other clients. Global resources are visible to all runnables. Process-specific resources like CUDA streams, and PVA streams are some examples of client resources that cannot be shared across different clients. Also, internal scheduling mutexes can also be modeled here. These resources are specified in a format like that of Global Resources.

Clients:
- Client1:
    Resources:
        ResourceType0:
        - ResourceType0_Instance0
        - ResourceType0_InstanceN
        ResourceTypeN:
        - ResourceTypeN_Instance0
        - ResourceTypeN_InstanceN

Resource Type: Software Resources for Hardware Accelerators

The following are a list of software resources that map to the corresponding hardware engines:

CUDA Streams: mapped to the GPU. To specify this resource, set the resource type to CUDA_STREAM.
PVA Streams: mapped to the VPU. To specify this resource, set the resource type to PVA_STREAM.

The hardware engine mapping is conveyed to the compiler when specifying the resource instances as shown in the example below. The specified hardware resource instances should be specified under the corresponding hardware resource type in the Global Resources section. Note that the compiler will throw an error if the limits on the mapped resource (as specified in the Hardware Accelerator Resource Types section) are violated.

Clients:
- Client0:
    Resources:
        CUDA_STREAM:
        - CUDA_STREAM0: GPU0 # CUDA_STREAM0 mapped to GPU0
        - CUDA_STREAM1: GPU0 # CUDA_STREAM1 mapped to GPU0
        PVA_STREAM: #A client can have one unique stream per VPU
        - PVA_STREAM0: VPU0 # PVA_STREAM0 mapped to VPU0

Resource Type: Local Scheduling Mutex

Resource types other than those specified in Global Resources section above are treated as local scheduling mutexes. These cannot be mapped to a hardware resource.

Clients:
    - Client0:
        Resources:
            LOCAL_SCHED_MUTEX:
            - LOCAL_SCHED_MUTEX0
            LOCAL_RESOURCE_MUTEX:
            - RESOURCE_MUTEX0

Here's a visual representation of the data, timing and resource boundries in STM:

Boundaries in NVIDIA System Task Manager

Runnables

Tasks in the system executed on hardware engines are known as runnables. Ideally each runnable should only use a single hardware engine. Currently synchronous runnables (runnable that use multiple hardware engines simultaneously) are not supported. Runnables require resources for execution and can be dependent on other runnables. Care must be taken to ensure that dependencies do not introduce a cycle in the graph. Depending on the type of engine used, the compiler classifies each runnable into one of the three following classes:

Submitter: A runnable that runs on a CPU and submits another runnable that runs on a different hardware resource. E.g.: A CUDA kernel that launches work on the GPU.
Submittee: A runnable that is submitted by a submitter to be run on a particular hardware resource. Following the example above, the CUDA task launched on the GPU is termed as a Submittee runnable.
Runnable: Any task that cannot be classified as a Submitter or a Submittee.

For each runnable, the following parameters can be set:

WCET (Worst Case Execution Time): The framework assumes that runnables have a bounded runtime. This runtime is captured by the WCET parameter. [Required Parameter]
StartTime: The start time of a runnable can be offset by this value from the beginning of its epoch.
Resources: List of resources that a runnable needs. Currently, a runnable can request only one hardware resource. However, there are no limitations on the number of software resources that a runnable can ask for. The requirement can be specified as a resource type (e.g., GPU) if the developer does not care for a specific instance of a resource, or as a resource instance (e.g., GPU0). [Required Parameter]
Dependencies: List of runnables that need to be completed before this runnable can be scheduled. These runnables are specified as ClientID.RunnableID. Dependencies can only be specified for runnables belonging to the same Epoch.
Submits: This field is present in Submitter runnable specification. It specifies the Submittee runnable ID. In a Submitter-Submittee pair, the Submits field should be populated. It is not necessary to add the Submitter runnable in the Submittees dependencies list.
Priority: Relative priority level assigned to a runnable between 0 (highest) and 10 (lowest). This field is used to break ties for scheduling decisions between multiple runnables at the same graph level. The default priority level is 10 for all runnables.

Clients:
    - Client0:
        Resources:
            CUDA_STREAM:
            - CUDA_STREAM0: GPU0
        Epochs: 
            - Perception.Camera: # Camera epoch in the perception hyperepoch
                Runnables:
                - ReadCamera: # Normal runnable
                    WCET: 10us
                    Resources:
                    - CPU # This runnable runs on a CPU
                    Priority: 2
                - PreProcessImage: # Submitter runnable
                    WCET: 20ms
                    StartTime: 1ms # Starts 1ms after the camera epoch
                    Resources: # GPU Submitter needs CPU0 and a stream
                    - CPU0
                    - CUDA_STREAM
                    Dependencies: [Client0.ReadCamera] # Depends on ReadCamera
                    Submits: Client0.PreProcessGPUWork # Mentions submittee
                    Priority: 2  # Has a relative priority of 2
                - PreProcessGPUWork: # Submittee runnable
                    WCET: 5000ns
                    Resources: [GPU]
                    Dependencies:
                        - Client0.PreProcessImage # Optional for submittees
            - Perception.Radar: # Radar epoch in Perception Hyperepoch
                Runnables:
                - ProcessRadar: # Runnable specification...

Round Robin

STM can round-robin between multiple runnables within an execution slot. This mode of operation is useful when performance constraints prevent the execution of all the runnables in that epoch. Further, it allows users to reduce the frequency of running any particular runnable. When runnables are round-robinned against each other, STM will use the union of their dependencies and schedule a slot that satisfies all of them. The time slot allocated to the round-robinned slot is set to the largest WCET value specified for the round-robinned runnables. Roundrobin groups are specified in the Epochs section for the corresponding Hyperepoch using the AliasGroup construct.

For each AliasGroup, the Steps parameter specifies the IDs of the runnables in the desired round-robin sequence.

Round robinned runnables are subject to the following constraints:

They must belong to the same client and execute in the same epoch.
They must have the same resource assignment.
If a round-robinned runnable directly depends on another round-robinned runnable, their step lengths as specified in the AliasGroups must be the same. Round-robinning works across unrolled frames in a hyperepoch too.

The following example shows a use case where two cameras are round-robinned against each other. In the even frames of the Camera Epoch, Client0.PreProcessCamera1 and Client0.ProcessCamera1GPUWork will run. In the odd frames, Client0.PreProcessCamera2 and Client0.ProcessCamera2GPUWork will run. Their parent and child dependencies are automatically taken care of - ReadCameras1And2 and PostProcessCameras will run for all frames and all these runnables will automatically wait for the correct dependencies.

Version: 3.0.0
Drive: # Graph ID
    Identifier: 101
    Resources: # Global Resources
        CPU: [CPU0, CPU1, CPU2]
        GPU: [GPU0]
    Hyperepochs:
        - Perception: # Hyperepoch ID
            Period: 100ms # Hyperepoch period
            Resources: [CPU0, GPU0, Client0.CUDA_STREAM0] # Resource mapping
            Epochs:
            - Camera: # Epoch ID
                AliasGroups:                            # Define Round Robin Groups
                - PreProcessRoundRobinGroup:            # AliasGroup's ID
                    Steps:                              # This group round robins between
                    - Client0.PreProcessCamera1         # Client0.PreProcessCamera1 and Client0.PreProcessCamera2
                    - Client0.PreProcessCamera2         # in alternate frames.

                - ProcessGPUWorkRoundRobinGroup:        # AliasGroup's ID
                    Steps:                              # This group round robins between
                    - Client0.ProcessCamera1GPUWork     # Client0.ProcessCamera1GPUWork and Client0.ProcessCamera2GPUWork
                    - Client0.ProcessCamera2GPUWork     # in alternate frames.
                Period: 14ms
                Frames: 2
    Clients:
        - Client0:
            Resources:
              CUDA_STREAM: [CUDA_STREAM0 : GPU0]
            Epochs:
            - Perception.Camera: # Camera epoch in Perception Hyperepoch
                Runnables:
                - ReadCameras1And2:
                    WCET: 3ms
                    Resources:
                    - CPU # This runnable runs on a CPU
                - PreProcessCamera1: # Submitter runnable
                    WCET: 3ms
                    Resources: # GPU Submitter needs CPU and a stream
                    - CPU
                    - CUDA_STREAM
                    Dependencies: [Client0.ReadCameras1And2]    # Depends on ReadCameras1And2
                    Submits: Client0.ProcessCamera1GPUWork   # Mentions submittee
                - ProcessCamera1GPUWork: # Submittee runnable
                    WCET: 4ms
                    Resources: [GPU]
                    Dependencies:
                        - Client0.PreProcessCamera1 # Optional for submittees
                - PreProcessCamera2: # Submitter runnable
                    WCET: 3ms
                    Resources: # GPU Submitter needs CPU and a stream
                    - CPU
                    - CUDA_STREAM
                    Dependencies: [Client0.ReadCameras1And2]    # Depends on ReadCameras1And2
                    Submits: Client0.ProcessCamera2GPUWork   # Mentions submittee
                - ProcessCamera2GPUWork: # Submittee runnable
                    WCET: 4ms
                    Resources: [GPU]
                    Dependencies:
                        - Client0.PreProcessCamera2 # Optional for submittees
                - PostProcessCameras:
                    WCET: 3ms
                    Resources: [CPU]
                    Dependencies:
                        - Client0.ProcessCamera1GPUWork
                        - Client0.ProcessCamera2GPUWork

An illustration of the execution sequence for this example in steady state is shown below:

Rollover Schedule Switch Restrictions

STM allows users to switch schedules at runtime through two mechanisms. If the rollover schedule switch operation is needed, the hyperepoch composition in terms of the number of hyperepochs and their mapped hardware resources is required to be the same across all the DAGs that participate in that operation. This restriction does not apply to the synchronized schedule switch operation.