Pervasive
Sign in | Join | Help
in

Pervasive DataRush

This blog is syndicated from the Pervasive DataRush site.

March 2008 - Posts

  • Come visit us at JavaOne

    JavaOne is our most important show of the year, and this year we will be showing our DataRush framework in our booth. We would love to meet you there, hear about your long-running analytic applications, and discuss the opportunity to try out the product. We would be happy to provide you with a free pass to the Pavilion, which is the exhibit area with over a hundred companies signed up so far. Just send us your info by entering a comment below or emailing me at shochschild@pervasive.com, and we will see you there!
  • We made the cover of JDJ!

    Our fearless leader Jim Falgout has written an article for the Java Developers' Journal on the process he and his team went through while developing a fuzzy matching application. We're proud to note that One Team, One Month, One JVM is the cover article for the March 25th edition.
  • What to do with all those cores?

    Larry Dignan asks in his blog what will we do with the 6-core systems Intel is planning to provde by the fourth quarter? Personally, I can't wait to get my hands on a system with 6-core processors. With DataRush, more cores equals better performance. For DataRush customers, better performance of their data processing equals more business. DataRush won't help Microsoft Word open the big documents on your desktop any faster, sorry about that. But it does provide drastic speed up of data processing, taking full advantage of as many cores as Intel and AMD can throw our way.

  • 400X better performance, no way! Really?

    Yes! A little more background on the problem. Part of the DataRush team quickly developed a matching application for a customer. More on that in a later blog entry. The output of the application is record pairs that are considered to be matches. And that's where we stopped as the customer decided to implement their own "roll-up". Not being sure what they meant by roll-up but glad it wasn't on our schedule, we marched on.

    Until the overall process ran and the step to do the "roll-up" took over 3 hours. But the matching step ran in minutes. Ok, now we care about roll-ups, so what are they? It's basically taking the set of record pairs and rolling them up into larger sets. For instance, record A and record B are matched. Record B and Record C are matched. So we want Record A, B and C to be in the same set. Only one record in a set wins, the others are thrown out as duplicates. That's easy with 3 records. Not so easy with millions.

    Right, doesn't sound to hard. It looks like a disjoint-set problem. One of the guys on the team rememberd an algorithm from his undergrad days, did a little Googling and then coded the algorithm in Java. He quickly wrapped that code in a DataRush process. The next step was building an application that read the input data (the output of the matching app), ran it through the disjoint-set algorithm and then wrote the output. DataRush already has very parallelized readers and writers, so that part was easy.

    Next step: punch the run button and 27 seconds later, we are done. With the same "roll-up" output that was generated by 3+ hours of running SQL statements.

    Many data jobs are not well suited to RDBMS processing and run much faster outside of the database. Looks like we found another one!

  • Domain specific languages

    Kind of a long blog, but interesting as always from Dr. Mattson. Check out point #3 towards the end of the entry. Dr. Mattson states that "Domain specific languages may be our best hope.". I think he is onto something there. Look at Java7 and the new concurrent library features that are being added: Fork-join and parallel arrays. These are both very useful constructs for helping to parallelize sections of your application code. But, being general in use, they are meant for in-memory only data structures. Very useful, but limited for large data applications where billions of events must be processed.

    DataRush is built with data scalability in mind. By that I mean you can run a DataRush application with thousands, millions or billions of rows with little to no adjustment and no re-coding. DataRush is domain specific; the domain being very large data analytics. Have billions of rows of data to process and analyze: DataRush can help!

    DataRush does not implement a new language, but is built on the existing concepts of dataflow architecture. Our goal is to provide a Java based set of libraries that allow programmers to build highly performant, highly scalable data analytic applications easily. All this using a language you are already familiar with and already enjoy great IDE support.

  • It ain't easy ...

    This blog post refers to a talk recently given at the IEEE Micro-39 conference. The talk is titled Using Sequential Programming Models to Program Manycore Systems by Wen-mei Hwu. Mr. Hwu's talk centers around the fact that parallel programming is not easy. We agree totally Mr. Hwu, that's why we built DataRush!
  • Pervasive and HP issue a press release highlighting linear scalability

    We are tremendously excited to have issued a press release this morning in cooperation with HP. They share our commitment to helping developers address the two-headed challenge of exploding volumes of data and shrinking time windows. Given the remarkable results of our initial deployments, this is just the first of many more such releases.
  • Parks Scheduling - the original paper

    We are very fortunate to have a fairly long and interesting history of academic and commercial research on dataflow, flow-based programming, and other relevant topics.

    Paul Morrison has his site, and I have attached Dr Thomas Parks' seminal paper on scheduling, which our team has implemented in Pervasive DataRush.

  • Linear Scaling

    We continue to do scalability testing on any and every platform we can get access to, particularly as bigger and bigger boxes become known to us. We have had an interesting time running DataRush on different processor architectures, different operating systems, different JVMs, never mind that each box has a different amount of memory, different clock rate, and different disk system. See a listing of these variations here.

    Results so far are exciting, and in general, this whole testing effort has been far more important than we first thought it would be. It has been remarkable to us to see how excited people are about the results, and how quickly people understand our framework’s purpose once we show them the graphs. Further, we continue to learn more each time we run them. For example, we were happy to get HP's HP-UX JVM team involved and gained some important insight there, whereas on another vendor’s system, we broke their JVM. They had never seen an app that wanted all the cores, all at once, all at 100%. They are using DataRush as part of their test suite now. At this point, our tests are very simple, embarrassingly parallel, and are focused on the single attribute of scalability -- how much additional performance do you get when you add resources. The Holy Grail is linear scalability which is defined as doubling performance with a doubling of resources, in other words, demonstrating that there is a minimum amount of overhead being added from adding additional cores. We are pleased and proud to say that we are achieving great results: near linear scaling on every platform. This hasn't been easy, and in some cases we have had to spend a lot of time with monitoring and profiling tools, but the smart people we are working with are starting to recognize the real achievement that DataRush represents. The magic is we don't have to touch a line of code in the application to do it. Because a developer using the DataRush platform is no longer responsible for manually designing and coding for a specific number of cores, a DataRush-based application can easily be deployed across machines of different capabilities. This is especially important to ISVs and commercial software partners who want robust data-intensive analytic applications to scale and deliver faster performance as the number of cores multiplies.
  • Appropriate Tools for Appropriate Tasks

    Pervasive DataRush is at a point where we are beginning to implement and deploy real customer applications through our Lighthouse Customer Program. This program provides a free license and significantly discounted professional services for the right opportunities. As we move to address more of these, we have identified some common attributes. Notes and Observations -- Pervasive DataRush is appropriate: • Where expanding data with shrinking time windows is generating a desire for improved performance • Where the task involves whole file operations; non-transactional • Where there are batch jobs that take all night and they want it in 3 hours, or 3 hours and they want it in 10 minutes, or 10 minutes and they want it in 30 seconds etc. • Where commercial-off-the-shelf-packages (including the Pervasive Data Integration products) are not suitable – a custom application is required • Where relational databases have too much overhead and are too slow • Where the organization has an early adopter attitude • Where there is a commitment to performance; even at the perceived risk of using a non-traditional approach • Where there is a willingness to publicize the results Does this list fit your situation? You might be just the organization we are looking for to prove once again the incredible performance of our approach. Give us a call -- all you have to lose is all the waiting for your results.
  • Dealing with the data tsunami

    The approach is well known – there is no debate -- parallelism is the answer. This answer is a result of hard physics, heat and power. We don’t need to think about this, but we do need to implement the systems. Systems consist of hardware and software. The hardware is there, the software isn’t. Note that the software is there for transactional work, so half of the problem is solved. We are glad that it is, because that half is creating: • a huge business in multicore machines • a huge effort to collect, store, and distribute the data So with the hardware platform at hand (go to Fry’s and slap down a credit card) and the data available, the missing piece required to realize the hidden value is the software architecture. Why do we think this? Some might disagree about one or more of these points, but they are our view. And we are not alone, the big-time analysts (Bloor, IDC, etc.) agree with these statements. Assertion #1 Developers aren’t ready. This is a known truth in the general developer community. Although classes, seminars, books, and articles are all trying to help, it is a really hard problem and everyone knows it. If a programmer thinks it is easy, then they don’t know what they are doing or haven’t yet even tried. Assertion #2 Languages aren't ready. We see lots of interest in ‘new’ alternatives such as Erlang, Haskell, Threading Building Blocks, etc., because programmers know that the low level functions are just that, too low level. Assertion #3 This isn't going to solve itself - there is no silver bullet coming. No free lunch, Intel isn’t going to ride in and save the day, Amazon computing in the sky isn’t going to make the problems go away We are certain that DataRush can help with this challenge - how are you dealing with it today?
More Posts
© 2008 Pervasive Software Inc. All Rights Reserved.