Using the dataflow model of DataRush makes writing high-throughput applications easier, but a dataflow graph is still only going to be as fast as its slowest operator. The standard solution is to replace the slow operation with a subgraph, which does the processing in parallel, increasing the effective throughput of the operator. For stateless operators (those whose current output depends only on the current input), this is simple to do: round robin partition the input flow, duplicate the operator for each partition, then round robin unpartition the outputs. Et voilà! Problem solved.
Unfortunately, not everything is so embarrassingly parallel. There are operators which are stateful - for example, one computing a running sum or a windowed average. What then? Has the graph met its match in Amdahl's law? Not necessarily. To understand why, first look at what happens when a stateless operator is horizontally partitioned. Each of the operators receive only 1/Nth of the data, so each runs in only 1/Nth of the time. All are in parallel, so the overall throughput is increased by a factor of N.

The above works because each operator is independent of all the others. The problem with parallelizing stateful operations in the same fashion as above stems from a lack of independence. Each operator needs information from the other parallel operators. If they pass this information among themselves, this forces them to run serially (and violates the dataflow model by introducing other channels of communication). Is it possible to break this dependence? Yes. The root of the solution lies in the observation that the state is a function of the complete input dataflow. By partitioning the input, each operator becomes dependent on the others because they have information not available locally. So the answer is to not partition, but instead to give the complete input to each operator while still leaving each responsible for producing only part of the output. Each can compute the state information required, since each has the full input. On the output side, these can be reassembled to produce the full expected output. Not quite as simple as partitioning, as the operator needs to be modified and an assembler might need to be written, but better than redesigning the whole graph!

So is this better than the single, unparallelized operation? If computing state requires computing the full output, then no. But in many cases, computing the state is a small portion of the work. Overall, the modified operator does less work and therefore finishes faster than the original operator. Since they are all in parallel, this means the whole input is processed in less time, even though we're duplicating effort (computing the state) in each operator!

As the above diagram illustrates, the real savings depend on the difference T - T' and cost of A, the assembly of the final result. It is possible that reassembly is, in itself, an expensive operation, negating any advantage gained. On the other hand, it might also be as simple as a round robin unpartition. So in some situations, it is possible to reduce overall runtime.
The lesson here is that inefficiency in the parallelization does not mean it's not worth parallelizing. It is easy to overlook this because the result is not the ideal division of work, but the goal is not to produce the perfect solution, only a faster one. This is just another example of the classic trade-off of one resource for another: in this case, CPU cycles for time. And in the world of multi-core processors, CPU cycles are plentiful, but there are still only 24 hours in a day.
So much for theory. How well does this work in practice? I'll visit that next time.