Commonly a data pipeline is processed in a single iteration of the schedule. Depending on the data dependencies and used hardware engines that can leave the overall utilization very low - especially when the beginning of the data pipeline uses different hardware engines than the end. Consider the following minimal example - for simplicity the passes are folded into the nodes:

DAG with Asymmetric Hardware Engine Usage

Without pipelining the schedule would look like this:

Schedule without Pipelining

Phases

Pipelining splits a data pipeline into multiple phases. The number of phases is the pipelining depth.

‍Note: Currently only a maximum pipelining depth of 2 is supported.

The phases of the Nth iteration of the data pipeline are spread over N iterations of the schedule.

the Nth iteration of the first phase of the data pipeline is run in the Nth iteration of the schedule whereas
the Nth iteration of the second phase of the data pipeline is run in the (N+1)th iteration of the schedule.

This allows scheduling phases of the data pipeline concurrently within the same scheduling iteration where the data dependencies would otherwise enforce them to run sequentially.

‍Note: for phase-1 nodes the data pipeline iteration counter equals the schedule iteration counter.

In the Nth iteration of the schedule the following phases are run in parallel:

the 1st phase of the Nth iteration of the data pipeline as well as
the 2nd phase of the (N-1)th iteration of the data pipeline.

‍Note: on the first schedule iteration the second phase doesn't have input data yet and needs to become a no-op.

Phases split across Schedule Iterations

Implementation

The implementation of splitting the schedule into two phases is done without STM having any special knowledge about it. When generating the input for the STM compiler, the DAG within an epoch is split into disconnected subgraphs. A former fully connected graph is split into N connected components - where N is the pipelining depth.

Break Data Dependencies across Phases

This "disconnection" is achieved by breaking data dependencies commonly derived from the connections in the DAG if the connections spans across phases - the source being a node in phase 1 and the destination being a node in phase 2. This allows the phase-2 nodes to be scheduled concurrently with the phase-1 nodes within the same scheduling iteration. (The case where the source being in phase 2 and the destination being in phase 1 implies an indirect connection which doesn't result in a data dependency anyway.)

Change Port Behavior

Additionally, the behavior of a subset of the ports is modified.

Ports which are the source of a connection spanning across phases annotate each sent message with their data pipeline iteration counter. With only two phases being supported the source port must be phase-1 node and therefore the annotated counter is also the schedule iteration counter.
Ports which are the destination of a connection spanning across phases are annotated with the offset of their phase to the first phase of the pipeline (for phase-2 ports the offset is 1). This enables consumer ports to determine their data pipeline iteration counter (which is their schedule iteration counter minus their offset). Only when the annotated data pipeline iteration counter of the message is smaller or equal than the data pipeline iteration counter of the destination port the data is being made available to the node in that iteration. This prevents reading messages "from the future" - from phase-1 nodes in schedule iteration N which belong to the data pipeline iteration N where the phase-2 nodes are still processing data pipeline iteration N-1.
Ports of connections which share the source port with connections spanning across phases must be updated to match the fact that the source attaches the iteration counter (and therefore changes the wire format of the data).

Performance

Optimal Scenario

In the optimal case the schedule iteration duration is reduced by half which implies that the rate the data pipeline is running at is doubled. As a consequence the hardware engine utilization is doubled too. The overall runtime of the data pipeline stays the same though - since it is just split evenly across two phases / schedule iterations.

Schedule with Pipelining - Optimal Result

Realistic Scenario

Realistically splitting the passes across two phases results in the schedule iteration duration to be longer than 50% of the original length. This still means that the rate the data pipeline is running at is increased. But as a consequence the overall runtime of the data pipeline might increase too.

Schedule without Pipelining

Schedule with Pipelining - Realistic Result

References

The schema for .app.json files captures the assignment of components in the schedule to phases under stmSchedules -> <hyperepoch> -> <epoch> -> passes.

Table of Contents