Pervasive Software CTO Mike Hoskins on the benefits of building parallelized multicore-powered Data-Intensive HPC
Can you describe your Data-Intensive HPC initiatives?
We have always sought out opportunities to work with researchers in pushing the boundaries of Computer Science. In this specific case, we have committed engineering time and financial resources towards research in parallelizing an interesting new approach to machine learning algorithms under the guidance of Joydeep Ghosh, Ph.D., Schlumberger Centennial Chair Professor of Electrical and Computer Engineering at The University of Texas at Austin. While that is the focus here at SC08, we are working with researchers and seeking synergies across a number of institutions and multidisciplinary programs.
What was the area of interest?
As the volume of data available to examine increases exponentially, it becomes more important and useful to perform analysis over an entire data set, over a span of historical time, as opposed to sampling or snapshot, point-in-time analysis.
Given a growth in the data used as input, and the increasing capacities of storage systems, the challenge for many organizations, both commercial as well as academic, becomes a matter of best utilizing the new multicore capabilities being delivered. This project is a demonstration of how parallelization can reduce the run time of an analysis, thereby allowing a larger number of data elements to be examined.
How is this relevant?
There are four BIG ideas that demonstrate how Pervasive DataRush can really supplement traditional HPC as data volumes grow:
Scaling for data. How do you rapidly scale for terabyte-sized data sources? Going forward many researchers and analysts will have to cope with huge data volumes and short time windows simultaneously. With its foundation on dataflow principles, Pervasive DataRush can elegantly tackle the problem of data-intensive HPC (DI-HPC).
Scaling for multicore. Both the CTO of Intel and the Chief Scientist of Microsoft have described the multicore revolution as the biggest paradigm-changer to hit computing in 30 years. Suddenly, massively powerful SMP boxes, which are essentially “personal supercomputers,” are within our grasp, with this form and design moving rapidly towards commodity hardware cost and ubiquity.
The constraint is software. Parallel programming is extremely challenging, and even more complex when contemplated at the fine-grained levels required by multicore systems/nodes. Our thesis is that DI-HPC needs a new generation of software technology. Pervasive DataRush was designed and built for this multicore world of fine-grained multi-threaded parallelism, and seamlessly powers applications that “auto-scale” as you add more cores. Using Pervasive DataRush, a DI-HPC world can emerge that can easily and fully exploit the modern world of multicore platforms.
Scaling for “cognitive load” (aka: programmer productivity). The “performance” challenge in HPC is not runtime performance, but “design-time” performance. Current DI-HPC programming paradigms are shockingly unproductive when it comes to how quickly a successful design can be built and tuned. The net result: programmer productivity in the DI-HPC space is disappointing. The advent of commodity multicore SMPs and hardware hyper-parallelism will only exacerbate the problem, as the gap in advanced parallel programming skills is being revealed to be a chasm.
Pervasive DataRush offers a radical new vision: an SDK and massively-parallel engine that hides much of the complexity of parallelism. Developers can build massively parallel data-intensive applications without having to worry about the many different challenges of parallel programming: memory management, threading, queuing, deadlocks, etc.
Scaling for economics: The relentless and astonishing competitive pricing power of hardware commoditization continues to produce price declines that will change data-intensive HPC architecture forever. The focus of this change is the multicore revolution, as seen in larger core count servers, but other contributors include amazing expansion of commodity disk capacity per dollar of cost, and increasingly available high speed connectivity.
Given decades of refinement of today’s conventional data-intensive HPC approaches, it will be a challenge to migrate away from current hardware and software “stacks” that include:
· hard to maintain/manage clusters of servers, which by their sheer number cause:
· a cascading avalanche of software licenses, driving up:
· management and administration cost to deal with distributed resources, and
· non “Green” technology, as they hit the power/cooling “wall”
Anyone can do the math: combine the incredible power and parallelism of commodity SMP boxes with a next-generation multicore-friendly fine-grained massively-parallel engine like DataRush, and the price/performance of the system is overcoming traditional cluster-based solutions. When you add the additional “green” advantage enjoyed by these multicore-based SMP boxes, the case to quickly explore and exploit this “economic scaling” is compelling.
Given the cost curve of the alternative hardware, current designs present no opportunity for economic scaling.
Each of these ideas on its own is very relevant. Combined they paint a game-changing transition in DI-HPC. We believe the very essence of progress in science depends on the ability of scientists to experiment and iterate frequently and rapidly. The current commercial computing landscape is facing exactly the same requirement: both segments need to crunch massive quantities of data in rapidly shrinking time windows.
How can others get involved, learn more, and take advantage of your efforts?
We are committed to strongly supporting academic and commercial research and software efforts, and we make the Pervasive DataRush engine and library available for academic use and trial download for development purposes at no charge. For more information, please visit our website at www.pervasivedatarush.com and download our framework.