Commonly a data pipeline is processed in a single iteration of the schedule. Depending on the data dependencies and used hardware engines that can leave the overall utilization very low - especially when the beginning of the data pipeline uses different hardware engines than the end. Consider the following minimal example - for simplicity the passes are folded into the nodes:
Without pipelining the schedule would look like this:
Pipelining splits a data pipeline into multiple phases. The number of phases is the pipelining depth.
Note: Currently only a maximum pipelining depth of 2 is supported.
The phases of the Nth iteration of the data pipeline are spread over N iterations of the schedule.
This allows scheduling phases of the data pipeline concurrently within the same scheduling iteration where the data dependencies would otherwise enforce them to run sequentially.
Note: for phase-1 nodes the data pipeline iteration counter equals the schedule iteration counter.
In the Nth iteration of the schedule the following phases are run in parallel:
Note: on the first schedule iteration the second phase doesn't have input data yet and needs to become a no-op.
The implementation of splitting the schedule into two phases is done without STM having any special knowledge about it. When generating the input for the STM compiler, the DAG within an epoch is split into disconnected subgraphs. A former fully connected graph is split into N connected components - where N is the pipelining depth.
This "disconnection" is achieved by breaking data dependencies commonly derived from the connections in the DAG if the connections spans across phases - the source being a node in phase 1 and the destination being a node in phase 2. This allows the phase-2 nodes to be scheduled concurrently with the phase-1 nodes within the same scheduling iteration. (The case where the source being in phase 2 and the destination being in phase 1 implies an indirect connection which doesn't result in a data dependency anyway.)
Additionally, the behavior of a subset of the ports is modified.
In the optimal case the schedule iteration duration is reduced by half which implies that the rate the data pipeline is running at is doubled. As a consequence the hardware engine utilization is doubled too. The overall runtime of the data pipeline stays the same though - since it is just split evenly across two phases / schedule iterations.
Realistically splitting the passes across two phases results in the schedule iteration duration to be longer than 50% of the original length. This still means that the rate the data pipeline is running at is increased. But as a consequence the overall runtime of the data pipeline might increase too.
The schema for .app.json files captures the assignment of components in the schedule to phases under stmSchedules
-> <hyperepoch>
-> <epoch>
-> passes
.