Processors continue to become more "core dense" as we approach a watershed mark in multicore development. We've approached the point where a 4 processor system is taking on the characteristics of a "cluster in a box". Mike Hoskins, our CTO at Pervasive has deemed March 2010 the "Multicore Month of March" - a clever alliteration alluding to the latest processor announcements of Intel and AMD. Intel is launching their Nehalem EX with 8 cores and AMD their Magny Cours processors with up to 12 cores. This AMD blog outlines a contest to put an AMD 48-core system to good use. The winner to receive a free 48-core box.
But what to do with all those cores? At Pervasive, we've been developing a platform called DataRush that helps developers of all skill levels write high performing, scalable Java code. Based on a dataflow architecture, the underlying framework has a very simple programming paradigm. Using a shared-nothing approach, the developer is freed from having to manage hundreds of threads, worry about synchronization and deadlock or deal with other parallel programming headaches.
We recently came across a set of cluster-based benchmarks called MalStone. The MalStone Benchmark was developed by the Open Cloud Consortium (www.opencloudconsortium.com). For additional information, see www.opencloudconsortium.org/benchmarks. Even though these benchmarks are targeted at clusters, we decided to build a DataRush application for the MalStone B benchmark and run it on a 24 core system. The results were astounding!
The MalStone B benchmark consists of a 10 billion row log file containing site-entity records. The file size approaches 1 Terabyte in size. The purpose of the benchmark is to compute a ratio for each site (w), per week (d), for all entities that visited the site, the percent of visits for which the entity became compromised at any time between the fisit and the end of week d. This type of processing is typical for cyber security applications.
For our testing at Pervasive, we used a single machine consisting of 4 AMD 8435 processors running at 2.6 GHZ. This gives the machine a total of 24 cores. The total memory capacity is 64 Gigabytes. The I/O system uses a RAID filesystem consisting of 5 SATA drives. This machine does have a large memory capacity, however during the test we generally utilize about 9 Gigabytes for the JVM.
Using this system we were able to achieve a 68 minute runtime for the MalStone B benchmark. That's approaching a Terabyte an hour of processing on a single node, inexpensive server-class machine. Contrast this runtime to the same test run on a 20-node cluster of 4-core boxes using Hadoop. The MalStone B runtime for the Hadoop cluster is 840 minutes. DataRush on a single node machine is 12 times faster than the small cluster! To be fair, the processers in the cluster nodes are not exactly comparable to the ones in the DataRush test box. This test still shows that an inexpensive, single server-class machine with software like DataRush that can take advantage of multicore processors can outperform distributed processing. Cluster in a box in action!
This is very exciting when put in context of the next generation of multicore processors due out in the "Multicore Month of March". We look forward to running this and other benchmarks on these new systems. Check back soon for updated numbers as I have a sneaking suspicion that we'll easily break through the Terabyte an hour threshold with the MalStone benchmark using the latest Intel and AMD systems.