Pervasive DataRush

This blog is syndicated from the Pervasive DataRush site.

October 2008 - Posts

  • Java at SuperComputing '08

    Multicore chips offer the potential to get more done not through faster processing, but by offering more processing on a single chip. It's only a "potential" to get more done, though, because applications must be written to do more than one thing at a time to see an actual performance boost. Parallel application development is a serious challenge in commodity application development because few of the millions of developers out there know anything about doing it, and even fewer know how to do it well.

    This is not news in the IT world where the industry's brightest stars are funding research labs, endowing professorships, and funding universities not only to prepare the next generation of programmers for parallel application development, but to go back to the drawing board and rethink the programming models, tools, and algorithms we'll need to successfully exploit the promise of multicore processors in everyday computing.

    The field is in flux, and the practices, programming models, and hardware of tomorrow have the potential to be radically different from today. But businesses have a job to get done today, and waiting around for things to settle means not making payroll tomorrow. Given this state of flux, how can you continue to get things done while preparing for the future?

    Managed execution environments, such as Java, offer a lot of flexibility for development today while protecting -- or, "future proofing" -- your application for tomorrow.

    While Java is not the only such environment, it is a popular one, and is widely available. More importantly for programmer productivity, Java applications can be written at a high level, and the major chip manufacturers are making significant investments in improving the performance of Java on their hardware. The pace of development is increasing, with the creation of libraries and constructs in Java offering higher levels of abstraction for the expression of parallel work. These abstractions increase the productivity of application programmers, and offer them the promise of future performance gains: the Java community and major hardware and software companies can continue to improve the performance of Java underneath the abstractions and the applications that use them will simply get faster.

    The first efforts to use Java in HPC and parallel computing a decade ago were disappointing, characterized by poor performance. But a decade of work improving the performance of JVMs has paid off, and developers can now expect performance on a par with other more traditional MPI-based approaches (http://hal.inria.fr/docs/00/31/20/39/PDF/RT-0353.pdf).

    All of this is relevant for us, as we move towards our release of our DataRush engine library.

    We are gearing up for our next show, SuperComputing '08, which is coming to Austin. We have gotten the biggest booth we have ever had at a show, and we are excited about showing off some work we've done on this topic. We are in booth 203, which is the first booth on the right side of the first aisle and we hope you will come on by!

    We also have a limited number of exhibit passes -- if you would like one, just email me, at shochschild@pervasive.com

  • Petabytes of data spilling on the floor.

    A million seconds is 12 days.
    A billion seconds is 31 years.
    A trillion seconds is 31,688 years.

    Last year IDC released the results of a study that found the world generated 161 exabytes of digital data the year before. How much data is that? A lot. Its 161,000 petabytes. Its 161 million terabytes. Its 161 billion gigabytes. All still really big numbers.

    For a more useful perspective, consider the US Library of Congress:
    With millions of books on its shelves, it is earth's largest library.
    Yet the Library's printed works collection is estimated to store the equivalent of only about 10 terabytes. To even get into the petabytes of data we have to include the contents of all US research libraries, and then the total is only 2 petabytes.

    While the sheer size of the numbers is entertaining, there are two serious points here. First, in 2006 we created more digital data than we could store. The same IDC study estimates that the world had only 181 exabytes of storage available in 2006. In that storage budget we had to store all the previously stored data, in addition to some fraction of the newly generated data. 2006 was the first time we exceeded our storage budget in the entire known history of humankind.

    Second, this storage gap is expected to grow rapidly: IDC estimates we'll have a total of 601 exabytes of storage available worldwide by 2010, but in that year alone we will create 988 exabytes of new data.

    Without sounding unnecessarily abstract, our society is in the midst of a profound change. There are important debates going on about what this means, and how much of what we create we'll be able to leave to future generations. These questions will take a long time to sort out.

    In the meantime, businesses and other organizations that thrive on data are faced with new challenges they need to respond to today. Until now they've had the luxury of assembling data over time, analyzing it offline, conducting long-term studies, revisiting old data, and so on. The trick was getting all of the data into one place; once that happened; it could be analyzed and re-analyzed as many times as one had a reason to do so.

    This is no longer the case. As streams of interactive data -- data about everything from online browsing behavior to health statistics to the financial markets -- multiply exponentially, data-centric organizations must change the way they think about data. Data streams now need to be processed during collection, either to support making tactical decisions in near real-time based on the stream, or to reduce the data to its vital characteristics for in-depth analysis offline.

    Some are viewing this change in information management as a catalyst for creation of new branch of computer science and engineering. Stream processing (as it is called) has sparked a great deal of interest and research among academics and businesses alike, and the architectures they come up with will determine how society continues to evolve while it soaks in this data soup.

    Pervasive's DataRush is right on the front line of this change. While not (currently) useful for the ultra-low latency required by algorithmic trading systems, DataRush offers very high levels of throughput, allowing for far greater use of actual data, rather than sampled or derived data.

    As co-sponsors of the recent Gartner Event Processing Summit, we recently spoke with a number of the top software providers in the event processing marketplace, and they all recognize the need for higher throughput in addition to the low latency they all compete on.

    What new insights and information could you get from the ever-growing wave of data you deal with?

More Posts