Pervasive DataRush

This blog is syndicated from the Pervasive DataRush site.

October 2009 - Posts

  • Quicker Dataflows by Duplicating Effort

    Using the dataflow model of DataRush makes writing high-throughput applications easier, but a dataflow graph is still only going to be as fast as its slowest operator.  The standard solution is to replace the slow operation with a subgraph, which does the processing in parallel, increasing the effective throughput of the operator. For stateless operators (those whose current output depends only on the current input), this is simple to do: round robin partition the input flow, duplicate the operator for each partition, then round robin unpartition the outputs.  Et voilà! Problem solved. 

    Unfortunately, not everything is so embarrassingly parallel.  There are operators which are stateful - for example, one computing a running sum or a windowed average.  What then?  Has the graph met its match in Amdahl's law?  Not necessarily.  To understand why, first look at what happens when a stateless operator is horizontally partitioned.  Each of the operators receive only 1/Nth of the data, so each runs in only 1/Nth of the time.  All are in parallel, so the overall throughput is increased by a factor of N.

    The above works because each operator is independent of all the others.  The problem with parallelizing stateful operations in the same fashion as above stems from a lack of independence.  Each operator needs information from the other parallel operators.  If they pass this information among themselves, this forces them to run serially (and violates the dataflow model by introducing other channels of communication).  Is it possible to break this dependence?  Yes.  The root of the solution lies in the observation that the state is a function of the complete input dataflow.  By partitioning the input, each operator becomes dependent on the others because they have information not available locally.  So the answer is to not partition, but instead to give the complete input to each operator while still leaving each responsible for producing only part of the output.  Each can compute the state information required, since each has the full input.  On the output side, these can be reassembled to produce the full expected output.  Not quite as simple as partitioning, as the operator needs to be modified and an assembler might need to be written, but better than redesigning the whole graph!

    So is this better than the single, unparallelized operation?  If computing state requires computing the full output, then no.  But in many cases, computing the state is a small portion of the work.  Overall, the modified operator does less work and therefore finishes faster than the original operator.  Since they are all in parallel, this means the whole input is processed in less time, even though we're duplicating effort (computing the state) in each operator! 

    As the above diagram illustrates, the real savings depend on the difference T - T' and cost of A, the assembly of the final result. It is possible that reassembly is, in itself, an expensive operation, negating any advantage gained.  On the other hand, it might also be as simple as a round robin unpartition.  So in some situations, it is possible to reduce overall runtime.

    The lesson here is that inefficiency in the parallelization does not mean it's not worth parallelizing.  It is easy to overlook this because the result is not the ideal division of work, but the goal is not to produce the perfect solution, only a faster one.  This is just another example of the classic trade-off of one resource for another: in this case, CPU cycles for time.  And in the world of multi-core processors, CPU cycles are plentiful, but there are still only 24 hours in a day.

    So much for theory.  How well does this work in practice?  I'll visit that next time.

  • Predictive Analytics World Conference Presentation on DataRush and Predicting Customer Behavior

     

    Watch Video Lecture on Pervasive Parallelism in Data Mining, click here

     UT Austin presents at PAW on using DataRush parallelism for fast and accurate prediction of customer behavior:

    4:55pm-5:40pm
    Track 2: Telecommunications
    Room: Walnut A&B
    KDDcup 2009 Competition Results: Orange Labs (France Telecom)
    Churn, Baby, Churn: Fast Scoring on Large Telecom Dataset

    Churn prediction and management is critical for companies in the fast and competitive telecommunication market. In this session, we introduce the dataflow computational model in the context of data and computationally intensive high performance parallel data mining. We present a highly scalable and robust model capable of scoring "propensity-to-churn" at the rate of 50,000 customers in a 1.6GB test set (Orange Labs France Telecom, KDD Cup) in 3 minutes on commodity 16-core CPUs. This is an effective scoring runtime of 3.6 milliseconds per customer, orders of magnitude faster than some systems. As a competitor in this year's KDDcup data mining competition, this speed enables more iterations towards improved performance; while the research focus was speed, resulting predictive accuracy ranked higher than 70% of competitors.

    Moderator: Eric A. King, The Modeling Agency

    Speaker: Srivatsava Daruru, Research Assistant, The University of Texas at Austin


More Posts