Pervasive DataRush

This blog is syndicated from the Pervasive DataRush site.

January 2010 - Posts

  • Pervasive Software Among 10 IT Companies to Watch in 2010

    Just days into 2010 and Robin Bloor includes Pervasive among 10 IT Companies to Watch in 2010 for the second year running.  Pervasive lands second on his list and Bloor believes we will be “worth watching this year.”

    Although Pervasive Software is comprised of four innovative technologies, Bloor praises Pervasive DataRush as the “so-far-little-recognized jewel” due to its parallel processing capabilities.  Pervasive DataRush released the general availability of its processing engine in March, 2009.  Bloor adds that “there are not many such engines and as we advance further into the world of multicore CPUs, every vendor that has a parallel engine is likely to experience strong demand. Parallel engines improve processing speeds by one or two orders of magnitude, bringing down query responses (for example) from hours to minutes.”  Early adopters of the revolutionary Pervasive DataRush engine are taking advantage of the impressive speed ups and beginning to use DataRush to enable their own applications. 

    The Pervasive DataRush team worked diligently and built data-intensive applications useful in data services, data mining, and predictive analytics that were also released last year.  Pervasive DataMatcher helps organizations detect fraud through duplicate records.  Pervasive RushRecommender is a scalable, dataflow implementation of collaborative filtering based on weighted co-clustering that provides organizations insight to their customer needs. 

    Dr. Nena Marín, Pervasive DataRush Chief Scientist, co-presented with The University of Texas at the KDD Cup 2009 in Paris, France.  The co-authored paper, titled “Pervasive Parallelism in Data Mining: Dataflow Solution to Co-clustering Large and Sparse Netflix Data,” was selected from among 686 total submissions and detailed work to deliver performance improvements in the Netflix recommender system running a computationally intensive co-clustering algorithm.  The successful duo went on to present accurate prediction of customer behavior at Predictive Analytics World 2009 titled "Churn, Baby, Churn: Fast Scoring on Large Telecom Dataset."

    Stay on the lookout as the Pervasive DataRush team uncovers new analytic applications in healthcare and cyber security, and new academic alliance advancements with TACC in the new year.  Also on the roadmap is improved speed and content filtering.  We continue to address the gap between proliferating multicore processors and exploding volumes of data to get you the information you need quickly.  Please email us if you'd like to qualify for the Early Adopters Program. 

  • Of teraflops and terabytes

    Before the holidays, I attended SC '09, this year's supercomputing conference.  While supercomputing has traditionally been the domain of academia, the needs of business and scientific computing are converging.  The datasets and processing needs of companies are growing such that the sizes of the problems addressed in both worlds are approaching the same.  Hardware vendors are producing solutions which address both the need to store large volumes of data and to provide large amounts of computational power.  But to do anything, the data and the computation must - at some point in time - be co-located.  With conventional server computing, this isn't really a problem, since everything is local to the machine.  In the more scalable architectures, however, data can be non-local.  Furthermore, the cost of accessing remote data can be great (and even variable in grid-based architectures).  So can you bring the terabytes and teraflops together to solve your problems?

    The traditional supercomputing solution is message passing (MPI), moving the data to the computation.  Nodes send messages to and receive messages from each other as needed to exchange data.  A very flexible, low-level approach.  But you also want to make sure you spend most of your time processing the data, not communicating with other nodes.  I attended a number of presentations discussing aspects of this, such as: optimizing the I/O during the initialization phase, load balancing the workload, assigning work to nodes to minimize data transfer costs, utilizing asynchronous messaging to reduce stalls in processing.

    Of course, the opposite approach is possible too, moving the computation to the data. In the ideal case, this what happens in map/reduce models like Hadoop.  The computational and storage architecture is one; data is spread across the nodes, with each map running on local data.  This means that distribution of the data becomes part of the storage - it doesn't completely disappear, but becomes a one-time, up-front cost.

    And what about dataflow?   As the name suggests, in the dataflow model the data moves.  In fact, you can view dataflow as a more structured form of MPI.  However, this structure provides one thing for free - pipelining allows computation and communication to overlap without special programmer effort.  This works equally well for both disk I/O - you can get good throughput on sequential reads from disk - and for cross-node communication.

    None of these are a universal solution.  Each has strengths and weaknesses.  As stated, MPI provides flexibility, but also places most of the responsibility on the programmer.  On the other hand, while map/reduce and dataflow do much of this work transparently, they both require thinking about problems in a way which fit the paradigm - which may not be the intuitive approach to the solution.

     

More Posts