Pervasive DataRush

This blog is syndicated from the Pervasive DataRush site.
  • Pervasive DataRush: Cost-effective security for companies in a challenging economic climate

    Security spending in a downturn is under tight scrutiny. PricewaterhouseCoopers found this to be the case when it surveyed 7,200 executives in over 130 countries for its 2010 “Trial by Fire” report. One of the report’s primary findings states: 

    Not surprisingly, security spending is under pressure. Most executives are eyeing strategies to cancel, defer or downsize security-related initiatives.
    Source: PricewaterhouseCoopers 2010 “Trial by Fire” report.

    PricewatershouseCoopers notes that 70% of survey respondents think it is important to consider canceling, deferring or downsizing security-related initiatives if they require capital expenditures while 71% respond similarly for initiatives requiring working expenditures.At the same time, survey respondents overwhelmingly said they considered security strategies to be important, including increasing the focus on data protection, prioritizing security investments on risk, reducing or mitigating major risks and accelerating the adoption of security-related automation technologies to increase efficiencies and reduce cost. With many analysts predicting a slow recovery, the pressure to cut or curb security spending could increase. Meanwhile, security intrusions continue, likely at a more sophisticated scale (consider “Ghostnet” for example).

    So, are there any cost-effective solutions available to meet growing security? Yes.The massively parallel-processing horsepower of the Pervasive DataRush™ data processing engine combined with our matching capabilities forms a powerful solution for translating massive amounts of raw data into actionable intelligence. Combined with Pervasive DataMatcher™, a cost-effective robust, innovative solution is available to financial, insurance, healthcare, law enforcement and homeland security organizations that want to leverage powerful, next-generation analytics for detecting fraud and corruption, complying with anti-money laundering controls, and security and compliance monitoring.

    The Proof is in the Pudding: MalStone B-10 Benchmark

    This year Pervasive DataRush conducted internal testing using the MalStone B. The benchmark examines large volumes of logfiles to look for anomalies that might signal intrusions or attempted intrusions.  

    The data file for MalStoneB is generated by a Python script and the MalStone records have the format:

    Event ID | Timestamp | Site ID | Compromise Flag | Entity ID

    These describe a visit by an entity to a site at a particular time. After the visit, the entity sometimes becomes compromised, which is indicated by setting the compromise flag to 1. Each record is 100 bytes.  MalStoneB computes a ratio for each week d, and computes for each site w, and for all entities that visited the site at week d or earlier, the percent of visits for which the entity became compromised at any time between the visit and the end of the week d.The Malstone benchmark can use a variable sized dataset, In the experiment a 10 billion row dataset totaling 1 terabyte was used.

    Summary of Details:

    • 32 core, 4 socket, 2.0 Ghz Intel Xeon X7550
    • 1890 seconds (32.5 minutes)
    • 5.29 million rows/sec, approx 509 Mbytes/sec

    The Result: In just 31.5 minutes, 10 billion records were searched for anomalies using Pervasive DataRush. This result is 26x faster than its competition.

    Pervasive DataRush can enable applications that scale seamlessly on a single multicore server (rather than a cluster) to prepare or analyze even massive datasets – at unprecedented speeds.  Cost-effective approaches to daunting security challenges – that’s what IT executives seek in the midst of a lingering downturn, and that’s what we want to give them. 

  • Distributed, Scalable Clustering for Detecting Halos in Terascale Astronomical Datasets.


    The process of stellar discovery has long made its home at High Performance Computing (HPC) systems.
    HPC systems have evolved into clusters of "fat" multicore nodes. Applications must take advantage
    of parallelism across nodes and at the node level to maximize scalability and performance/watt.
    The complexity of multicore programming underscores the need for powerful and efficient runtime
    systems that manage resources such as cores, threads, memory, and communication sub-systems on behalf
    of the application. Dataflow is the computational model in Pervasive Datarush to construct
    efficient data-parallel pipelines via threads while abstracting the complexity of multicore programming.

    Simulation of the dynamical evolution of the entire observable universe via N-Body interactions
    begins with the presumed first principles of the universe: cosmic background radiation; an expanding
    volume of cooling helium and hydrogen; dark matter separating from gas and coalescing into massive
    stars. The classical N-body problem simulates the evolution of a system of N bodies, where the force
    exerted on each body arises due to its interaction with all the other bodies in the system. N-body
    algorithms have numerous applications in areas such as astrophysics, molecular dynamics and plasma
    physics. The Cube3PM method for carrying out large N-Body simulations to study formation and evolution
    of the large scale structure in the universe combines direct particle-particle forces at small scales
    with particle-mesh ones at larger scales (Particle-Particle-Particle-Mesh Method). Such an approach
    produces datasets with 4000^3-5488^3 (64-165 billion) particles. Several such simulations were
    completed on Ranger Cluster (Texas Advanced Computing Center) on 4,000-22,976 cores.

    From the astrophysicist’s perspective, the problem of identifying regions of interest in this terascale
    data and being able to visualize these regions is made intractable by the overwhelming volumes of data.
    Current methods to detect individual halos fall into two basic categories, namely friends-of-friends (FOF)
    and spherical over density (SO) methods. The FOF method is particle-based. Dense regions are identified
    by locating particles that are closer to each other than a pre-defined distance, which is a parameter of
    the model and is usually referred to as 'linking length' . Particles that are within that distance from
    each other are called 'friends', and the halos produced consist of all particles which are connected by
    a chain of friends. The SO class of methods, on the other hand, start by identifying the local density
    peaks (or gravitational potential minima) as the halo centers and then expand spherical shells around
    those centers until a pre-defined density threshold (a free parameter of the model picked based on
    dynamical considerations) is crossed. Within these types of methods there are multiple variations,
    regarding e.g. how the halo centers are located, how the gravitationally-unbound particles are treated,
    etc. Each of the two basic approaches, FOF and SO, has its advantages and drawbacks and can fail in
    certain situations (Tinker et al.).

    "Automated methods for halo identification and visualization are critical to advancing the physical
    understanding of what is happening through better analysis", said Astronomy Centre at the
    University of Sussex, UK.

    Our dataflow methods supply an alternative to the current approaches which on one hand is
    density-based like the SO, but does not make assumptions about the halo shapes as the SO does.

    This dataflow implementation distributes itself across multiple nodes on the Longhorn cluster.
    Likewise, this parallelized dataflow AutoHDS facilitates the use of large number of cores on a
    single cheap machine instead of expensive super computers. Experiments revealed that when data
    points were uniformly distributed across partitions, dataflow AutoHDS achieved linear speed up
    with the increase in the number of machines used. Dataflow AutoHDS also yields better performance
    with increasing data volumes.  In comparisons against Hadoop AutoHDS, dataflow was consistently
    faster on fewer resources.

    This work has been submitted for publication.  Coming here soon....

  • Need to Boost Your Analytics?

    You probably understand that Pervasive DataRush is a fast processing engine and is great for data preparation, but did you know that the just released version 4.4 also includes an analytics platform?  Pervasive DataRush for Analytics now allows users to analyze their entire gigantic data set (instead of just a sample) and make use of the newly-available core analytics library to perform data mining operations leveraging the high throughput and scalability of the Pervasive DataRush engine. 

    The new DataRush Core Analytics Library includes several classifiers including Naïve-Bayes, KNN, and decision trees; clustering using a k-means algorithm; association rule mining (ARM); linear regression, logistic regression, multiple regression and polynomial regression; and PMML model support.    

    Pervasive DataRush for Analytics helps organizations dramatically reduce the time necessary to do complex analysis of big data on a single multicore server.  A recent Pervasive DataRush case study shows how the ability to read production parameters and adjust production optimization models quickly with DataRush for Analytics in seconds versus minutes or hours positively affected their bottom line.  Now organizations have the ability to process complex risk calculations faster than before with higher accuracy, which is a huge advantage for timely decision support.  

    The enhanced 4.4 release also includes expanded data preparation capabilities, a JAVA SDK to extend and customize data preparation and analytics capabilities, and a JavaScript-based scripting interface to help you take control of your ever-growing data.    


     

  • Hadoop as a stepping stone ...

    Gordon Brown posted a blog on Redfin commenting on a presentation by Jeff Hammerbacher discussing the use of Hadoop at Facebook.Having solved many big data problems at Facebook, Jeff has great credibility in this area. Jeff discusses how Hadoop was used to replace a massive, centralized data warehouse that was pushed to its limits. Hadoop definitely saved the day at Facebook. With Hadoop, Facebook as been able to process the huge amounts of data required to continue adding needed features to the site.

    Jeff makes several interesting points about Hadoop. I think one of the more interesting is that he views Hadoop as a stepping stone to what's really needed: an extremely flexible approach to large data problem solving. He also points out that once developers moved away from the constraint heavy environment of SQL into the more open environment of Hadoop, their creativity went up, way up. Remove unnecessary constraints and innovation follows.

    Jeff mentioned dataflow as an architecture made for big data processing. Working on Pervasive DataRush, I couldn't agree more. DataRush is based on a dataflow architecture. This enables building big data applications that can handle the volumes in a scalable manner. Not only that, the programming model is very flexible and easy to learn. To Jeff's point, DataRush can remove some of the constraints and inflexibility of a limited Hadoop programming model. As always innovation will follow.

    But can DataRush handle the volumes of Facebook? Perhaps not at the highest levels of Hadoop, at least not today. But, we have shown incredible throughput numbers of 2 Terabytes per hour on the Malstone B10 benchmark. And that is on a single machine, based on Intel 7500 processors. This 32-core box outperformed a 20-node (80-core) cluster by a factor of 26. As compute density increases dramatically, scaling up to take advantage of many-core systems is imperative and pushes out the need to invest in a cluster for handling big data. Scaling up is where DataRush started and continues to excels today.

  • Is Intel On To Something?

    If you ask anyone on the DataRush team, they could tell you more than you wanted to know about how we are an innovator in massively parallel processing.  They could tell you about use cases and fast runtimes and how DataRush is the bees knees for data-intensive applications.

    But what else would you expect to hear?  Instead of telling you how great DataRush really is, we went to Intel, a technology leader, and asked to demonstrate the speed of our parallel processing engine on their hardware.  Next thing we know, we’re testing on a brand new Intel Xeon processor 7500 series and the results made our team extremely enthusiastic.  In fact, our results were superior to those of clusters:  nearly 2 terabytes an hour! 

    Intel must have like it because they just released a success story about our accomplishment.
     

     

    Not that we’re bragging, but here’s some of what Intel had to say:

    Better performance on simpler infrastructure.  The MalStone benchmarks were developed to measure performance on cloud infrastructures, and these single-server results are competitive with or superior to those of compute clusters.

    Scalability to many cores. As shown in the chart, “MalStone B Benchmark Scalability with Pervasive DataRush,” the solution scales well as the number of execution cores increase. 

    Simpler implementation.  The Pervasive DataRush solution architecture is designed explicitly to be applicable to virtually any type of application, abstracting multicore scalability away from the rest of the program logic. 

     

    Together, Intel and Pervasive DataRush, are tackling massive data challenges on a small footprint.  Processing two terabytes an hour opens up a door to endless possibilities that is being recognized by technology leaders, developers, and end users everywhere. 

    Don’t take our word for it – but maybe Intel is on to something…

     

  • What if genetic tests could be done in minutes instead of days?

    For many medical tests, time is an inherent part of the equation.  In most cases, medical results are delayed by compute time.  A good example of this is genetic testing.  Research finds that a patient’s genetics are an important factor in the efficacy of treatment.  Two examples announced recently illustrate this.

    Plavix, the second biggest selling drug after cholesterol-lowering Lipitor, is intended to prevent blood clots that can cause heart attacks and strokes in patients with advanced cardiovascular disease. It's also commonly prescribed to patients treated with devices called stents to prop open diseased coronary arteries. Some 2.5 million to 3 million Plavix prescriptions are written in the U.S. every month. But a genetic variation in a significant minority of patients can prevent the drug from working, or can limit its effectiveness, increasing a patient's risk for a potentially life-threatening heart attack.  Unfortunately a patient typically waits at least two days, and often longer, for the genetic results to come back to determine if Plavix is right for them.

    An even more widely used example is determining which of two popular dieting regimes will be more effective.  Many dieters choose a low-fat approach; others choose a low-carb diet.  What if individuals could easily test for their genetic makeup to determine the best diet for themselves to obtain fast weight loss?  Presently, testing for this is time-consuming and expensive. 

    But what if patients could see results in minutes instead of waiting weeks on all types of medical testing?  Leveraging all the available power of today's multicore hardware at the same time can result in dramatically faster speed.  Pervasive's DataRush platform attacks analytical applications with fully parallel processing and result in significantly shorter wait times.

     Faster results equals better health, deeper understanding, and improved outcomes.  Why wait? 

  • Will Blog For Data

    Working for Pervasive Software gives me insight into some of the large data problems that exist.  I recently read this blog, “Big Data, Big Problems” from Gizmodo, that talks about the explosive collection of data that is collected on a daily basis from everywhere.  It was astonishing to read that Walmart’s transaction databases are 2.5 PETRABYTES!  Is Walmart really analyzing all that data for insight?  As you know, Analytics helps improve business performance in the future based on past historical patterns.  Therefore, analyzing a whole data set would provide greater accuracy in prediction than analyzing just a sample, which might explain Walmart's success.     

     

    But what if you’re not #1 on the Fortune 500 list?  I want to bring the argument closer to home.  I’d love to know what data problems YOU are experiencing.  Leave a comment and let us know what data dilemma you’re trying to solve. 

     

    Are you waiting hours or days to process data for decision support?  Are you only analyzing a sub set of your data because your data set is too large?  Are long iterations slowing down your discovery process or scientific modeling?  Is saving minutes when analyzing data crucial to your bottom line?  Would the ability to perform multiple calculations at the same time be a benefit? 

     

    It would be interesting to see if we are all tackling different problems or if we share a core set of similar challenges.  So tell me, what keeps you up at night?

  • We’ve Crashed Through The 1 Terabyte Wall!

    And we’re not done yet—Pervasive DataRush is now approaching 2 Terabytes/Hour on a single server!!

    Pervasive® DataRush™ has again proved that you can rapidly process terabytes of data on a single server. Recent tests on an Intel Xeon 7500 processor-based server, Pervasive DataRush processed 1 Terabyte of data in just over 30 minutes—specifically, 10 billion rows and 5 columns of data!  Pervasive DataRush ran on Java 7 and multicore hardware tapping 32 cores and 64 threads.

    Our ongoing testing further proves that Pervasive DataRush has the scalability and speed needed to handle significant volumes of data (and more). The testing results offer compelling possibilities in the world of big-data analytics for medicine, science, government and business. Imagine… 

    • If researchers could rapidly determine optimal local alignment of genomic sequencing.
    • If doctors knew patient results in minutes.
    • If the government could identify cyber-security threats in near-real time
    • If business owners could run a query on their entire enterprise data set in minutes and optimize decision-making.
    • If scientists could run multiple iterations on complex models in minutes to discover cures faster.

    And stay tuned:  Pervasive DataRush engineers confirm Pervasive DataRush is now approaching 2 terabytes per hour on a single server.  Further details on the testing will be posted soon!

     

  • Cores, Cores Everywhere But Not a Thread to Switch

    As Intel and Advanced Micro Devices battle it out for processor dominance, we as consumers are finding faster and faster computers for increasingly lower prices.  Quad core systems are now common place.  Recent announcements from Intel on their new Xeon 8 core processors and AMD's 12 core Magny-Cour continue to send processor counts on an upward spiral.  This means 16 to 24 available processing cores on a standard two socket system!  Not so long ago that was considered a large scale server, but now it's a mainstream server.

    Great, there is all this processing power in a small 1u system what can you do with that?  Unfortunately software hasn't kept up with the explosion of cores.  Many IT professionals are finding that is' easier to move to virtualization than to fully utilize a large system.  That has several advantages of course, consolidation of systems, lower costs in power and cooling but there clearly has to be a better way to utilize all those cores.  As core counts increase and two socket systems ship with 32 to 48 cores there is only so much virtualization that you can do.  

    There are software solutions that allow developers to best utilize these cores.  Most programming languages have thread constructs that help the developer write multi-threaded applications.  As simple devices those constructs are filled with potential pitfalls even for experience developers, with deadlocks, race conditions, and other synchronization issues that hurt performance and limit scalability.  Effort into finding ways to work around that by using higher level constructs or programming paradigms are in the works and have been for some time.  In some cases whole languages have been written to help here.  Some newer ones include Google's Go, Map/Reduce, Scala and many others from decades of research.

    Pervasive has a software platform that might help, an easy to use programming paradigm based on Java.  Pervasive's DataRush uses a DataFlow programming model that allows the user to concentrate on the application rather than the synchronization issues.  A DataRush application can automatically scale from a 4 core development system to a 32 core deployment server without having to rewrite the application.  Ease of programming, deployment and all with standard Java!  What do others think?  Are there any solutions out there that can rival DataRush in terms of performance and ease of programming?  Let me know.

  • Fraud Detection and “Finding a needle in a haystack”

     

    Fraud Detection and “Finding a needle in a haystack”

    Benford’s law has been promoted as providing auditors with an automated tool that is simple and
    effective for fraud detection. 

    The law of anomalous numbers was published in 1938 by Frank Benford [1].  Benford noticed that if
    a number is chosen at random from a large data set, the probability of the first digit of such number
    is a 1 is about 30.1%, the probability that the first digit is a 2 is about 17.6%, and in general the
    probability that the first digit  is d is log10(1 + 1/d).  This distribution was named after Frank Benford
    as Benford’s Law, thou Canadian-American astronomer
    Simon Newcomb [2] discovered this statistical
    principle back in 1881.
    The principle behind this curiosity in numbers is the distance between numbers in a logarithmic scale.


    Figure 1 – Logarithmic Scaled Paper.

    log10(2) - log10(1)  = distance between log of 1 and log of 2 is 0.3010. 

    log10(3) – log10(2) = 0.1761

    log10(4) – log10(3) = 0.1250  and so on…

    The expected distribution represents the distance between digits in log scale as depicted in the
    logarithmic scale, see Figure 2.

     
    Figure 2 – Benford’s Law of First Digits Distributions and corresponding Logarithmic Scale.

    Benford’s Law is the DNA-equivalent technique for financial analysis of financial datasets. Any data
    that has been artificially generated will not follow this 1st digit distribution. As an example in Figure 3,
    I have taken the first two columns in the census 2000 SF3 dataset [4].  I then used a random number
    generator to add a third column which I called “incomeTotal”.  The black line below shows the
    expected Benford’s Distribution of first digits.  The blue and green lines correspond to “housingTotal”
    and “ownTotal” columns, first digit distributions.  Both of these columns follow Benford’s Law
    represented by the black line.


    Figure 3 – Benfords Law of First Digits on Census 200 Dataset Example.

    Conversely, Benford’s Law analysis clearly indicates that a significant portion of “incomeTotal”
    contains artificial numbers (See red line, Figure 3) . The desired result from such analysis is a list of
    attributes or columns and the particular row(s) where the first digit counts deviate from the Benfords
    distribution.  In this figure, column labeled “incomeTotal” and all its rows whose first digit begins with
    digits 3 or 4 would be listed as possible fraudulent transactions. One disadvantage of this approach is
    that a sufficient sample size of data is required before a meaningful distribution can be determined. 
    As a first line of defense for automated artificial numbers filter, Benford’s law of first, second, third
    and fourth digits is a widely accepted practice by auditors. 

    A dataflow implementation of Benfords’ Law first partitions a large scale dataset by rows (horizontal
    partitioning).  Subsets of rows are read, parsed and extracted in parallel using the DataRush
    PartitionedDelimitedTextReader operator.  Each data row becomes a data token. Each data token
    contains all fields, column values; for a single row. Multiple data tokens or rows are stored in each
    partition as a DataRush RecordFlow. Data tokens flow downstream in the application graph (See
    Figure 4).  As each data token arrives to the DigitCount operator, each column’s first digit counts are
    binned (1 thru 9 bins). Binning and counting happens in parallel for each partition.  Figure 4 shows the
    dataflow application graph for the DataRush implementation for scalable Benfords’ Law. Each dataflow
    pipeline computes the first digit counts concurrently.  Then the dataflows are reduced to the final
    probability distributions of first digits by aggregating the counts from each partition and normalizing
    by the total counts.


    Figure 4 – DataRush Benford’s Law Application Graph.

    Figure 5 shows scalability of the parallel implementation across 1 thru 8 cores.


    Figure 5 – Scalable Benford’s Law operator across 1 through 8 cores.

    The red line in Figure 5, represents expected runtimes based on linear scalability across increasing
    number of cores.  The blue line shows the preliminary results of the Benford’s Law DataRush Fraud
    operator on the census 2000 dataset. Typically application runtimes using a single core are expected
    to be reduced by one-half when increasing the number of cores from 1 to two.  Likewise, single core
    application runtimes are expected to be reduced by one-fourth when increasing the number of cores
    from 1 to 4.  DataRush applications automatically self tune at runtime based on available resources
    to scale with number of cores.


    Figure 6 – Benford’s Law Application Java Code.

    DataRush framework facilitates the implementation of self-tuning, scalable parallel applications like
    this Fraud Detection example.  The Fraud Detection application graph took approximately 6 hrs to
    compose.  This typical implementation involved the use of existing DataRush I/O operators as wells as
    the implementation of new “BenfordBinning” and “ReduceResults” Operators.  Figure 6 shows the
    BenfordsLaw Class implementation.  The “VectorReader” operator calls DataRush PartitionedTextDelimitedReader operator to read and create “partitionCount” partitions of a dataset
    based on current available cores.  The “For” loop within the “BIN” section of the code adds one
    BenfordBinning operator per partition and names the worker thread “DigitCount”.  For example: at
    any time on a 24-core computer, 24 “DigitCount” threads may be running concurrently.  Each thread
    is running one instance of the “BenfordBinning” operator for a total of 24 concurrent binning calculations. 
    In the “AGGREGATE” section of the code “RoundRobinUnpartitioner”; a part of DataRush core operators
    library, is called to un-partition the dataflows in preparation for the final step “ReduceResults”. In this
    final step, the counts for each digit and each column from the multiple partitions are summarized into
    a global count for each of the nine bins (1-9).

    The Benford’s Law of anomalous numbers has been broadly used in budget, income tax, financial and
    forensic accounting analysis. This blog uses this well known and broadly used algorithm for Fraud
    Detection as an example to demonstrate the benefits of the DataRush platform.  This application (see
    Figure 6) is comprised of 71 new lines of code that fully utilize multicore computers. The DataRush
    platform addresses gaps in design time cost, programming, parallelism, scalability and
    performance/watt, enabling rapid prototyping of data and computational intensive applications.

     

    References

    [1]          Benford, F. "The Law of Anomalous Numbers." Proc. Amer. Phil. Soc. 78, 551-572, 1938.

    [2]          Newcomb, S., http://en.wikipedia.org/wiki/Simon_Newcomb#Benford.27s_law

    [3]          Nigrini, M.J., “Taxpayer compliance application of Benford’s law.” Journal of the American
    Taxation Association, 18(1):72-92, 1996

    [4]          Dataset: US Census 2000 Summary File 3 (SF3)
    REF: http://www.census.gov/Press-Release/www/2002/sumfile3.html

     

     

  • What would I do with 48 cores?

    I mentioned in an earlier blog, the contest sponsored by AMD around the release of their 12-core 6100 series processors. With a 4p system that gives you 48 cores on a single box! This is an amazing amount of compute power in a very compact form. This latest line of processors from AMD includes 4 channels of DDR-3 memory and twice the memory throughput of their current products. This is an extremely important part of the AMD architecture as memory access can dominate today's memory hungry applications.

    So what would I do with a 48-core system? Working in the DataRush group at Pervasive, the first thing I would do is to benchmark DataRush on a really huge problem. The largest problem we've been working on lately is the Malstone-B benchmark. The MalStone Benchmark was developed by the Open Cloud Consortium (www.opencloudconsortium.com). For additional information, see www.opencloudconsortium.org/benchmarks. We've been running the Terabyte sized problem on a single-node box in well under an hour and would like to break that down even further.

    Running a benchmark may not sound historic or world changing. It could be said that it's not such an exciting thing to do with breakthrough technology, but the basis of this benchmark is cyber security: trolling through mountains of data looking for security breaches in networks. Put into perspective, computer networks form the backbone of our national defense structure. It is highly imperative to know the who/when/where and how of any intrusions on these networks. They are constantly under attack by hobbyist hackers, but also by professionals looking to compromise our national security.

    The amount of profilable data generated by these networks is huge and a processor such as the AMD 6100 series provides the needed muscle to power the analysis. A single machine composed of the 12-core 6100 series processors can provide the ability to crunch through multiple Terabytes of data an hour. This increased capacity allows much more data to be processed and also allows for more in-depth analysis. This includes finding patterns of abuse, modeling those patterns and enabling near-real time detection of network intrusions. As new algorithms are developed, the extra horsepower provided by more cores will be desperately needed. All on commercially available hardware, impressive!

    Software such as DataRush is a perfect match for this new line of processors. DataRush can help exploit all 48 cores on this AMD system and will gain a huge boost from the memory throughput improvements. Hardware and software working together in the new multicore world can do amazing things!

     

  • Going really fast ...

    In a recent blog, Robin Bloor discusses how parallelism is required for software to go really fast on todays multicore computers. He brings up this point about MapReduce: using MapReduce on problems it wasn't intended to solve is "... like playing golf with a single club". I'd like to expound a bit on this analysis.

    MapReduce was most famously implemented by Google to fulfill their need to index the world wide web. Quite an undertaking! And MapReduce proved to be critical to their success. The programming model for MapReduce fits perfectly with the problem of finding words within documents and creating indices for later (very fast) lookup.

    However, the MapReduce programming model can be limited when applied to other, more complex problems. Many deep data analysis algorithms require multiple, complex steps to produce their output. In these cases, a more general use programming paradigm is a better and more efficient fit. Hence, Robin's analogy to "... playing golf with a single club".

    Robin goes on to discuss DataRush and the capabilities it brings to bear. Based on a dataflow architecture, the programming model of DataRush is much more flexible and general use than MapReduce. I wouldn't use DataRush to index all the content of the internet, but it has proven to be an excellent tool for general data processing and data mining. And it has the ability to utilize all of the cores available on today's multicore systems. Put into perspective, we have benchmarks showing Terabyte an hour (and even better) processing of network log data for a cyber security application on a single box.

    So does playing golf with only one club mean you can't play golf? Of course not. But it does mean you can't play as well or as efficiently as if you used all the clubs at your disposal. The flexibility of DataRush allows you to utilize a full programming paradigm especially suited to big data problems.

    Robin has a knack for the turn of a phrase. Check him out on Twitter. He also has a very funny (and informative) book out called "Words You Don't Know". I especially like the chapter on swear words. Smile

  • HIMSS 2010 - DataRush in Health IT making a difference in patient care

    The DataRush team just returned from attending the HIMSS 2010 conference in Atlanta (March 1-4th).

     

    HIMSS is traditionally a large conference. As massive as the Atlanta convention center is, this year once again HIMSS filled the venues. Total attendance on March 2nd was recorded at 27,451 with 13,530 as exhibitors. This years’ theme was “Health IT (HIT) making a difference in patient care”. The DataRush team staked its camp at booth#3325 where it visited with presenters, vendors and attendees discussing large data pains and synergies between interoperable health IT and the DataRush parallelism engine.

    Day 1:  Started early with a lunch briefing with a global leader IT services. The architect of the largest healthcare informatics data warehouse shared his experiences with content management and searchable health information. Performance bottlenecks are common in data warehousing (DW) and BI data loading and consumption. Additionally, consolidation of data from up to 40 member companies had its challenges with speed and accuracy of data de-duplication.  In reference to DW loading, Pervasive DataRush offers a platform that leverages multi-core technology enabling unprecedented data throughputs.  In response to data quality and consolidation, Pervasive DataMatcher combines the power of DataRush parallelism engine with sophisticated parametric string matching algorithms to provide the highest precision (fidelity) and recall (completeness) in fuzzy matching.

    Once the exhibitor’s area opened midmorning, DataRush booth featured live demos of Pervasive DataRush Analytics plug-ins for KNIME.  KNIME (Konstanz Information Miner) is an open source data mining tool.  The upcoming DataRush release 5.0 provides KNIME nodes that offer DataRush powered versions of several data mining algorithms. The KNIME interface provides a user friendly workflow-like environment for data miner practitioners to rapidly orchestrate and deploy predictive analysis. Healthcare system integrators delivering EMR and EHR based decision support are turning to predictive analytics and data mining. Pervasive DataRush speeds and accuracy combined with the KNIME interface enables instant processing and response from Tera-scale heterogeneous data repositories.

    Day 2:  Booked back-to-back with briefings to target specialized areas in Healthcare data exchange, interoperability, network support and cyber security. Healthcare data exchange deals with transformations between End-User formats and ANSI EDI, HL7, etc. standards.  These transformations are facilitated via ETL tools.  Pervasive DataRush powered schema-to-schema transformations supports healthcare standards as defined by the American National Standards Institute (ANSI).

    Within cyber-security, Pervasive DataRush presented results from a benchmark study based on Malstone algorithm for web exploits. The Malstone DataRush implementation is capable of processing 1Tb of web log files per hr. Furthermore, DataRush is 12 times faster on 19 less computers when compared to the Malstone algorithm implementation using MapReduce on a 20 node cluster. Customers requiring enhanced cyber security discussed their bottlenecks in vulnerability assessments. An important task in cyber security involves analyzing large volumes of network data against databases of Common Vulnerabilities and Exposures (CVE) software flaws, Common Configuration Enumeration (CCE) mis-configurations, etc. and then scoring them based on National Institute of Standards and Technology (NIST) severity scoring system for vulnerability.  The Common Vulnerability Scoring System (CVSS) provides an open framework for communicating the characteristics and impacts of IT vulnerabilities. DataRush offers data parallelism where subsets of network data can be concurrently read in blocks and processed. The processing here refers to concurrent computation of CVSS core equations for “AccessComplexity”, “Authentication”, “AccessVector”, “ConfidentialityImpact” and “AvailabilityImpact”. The final step in the processing pipeline is to aggregate these scores to produce the CVSS “BaseScore”.

    Regarding network support and interoperability, discussions were centered on DataRush implementations of parallelized encryption and decryption algorithms as well as baseline change detection algorithms to detect anomalies in network log files.  These are novel applications of DataRush driven entirely by customer need.

    Day 3:  Had the most traffic through our booth.  Fraud, waste and abuse in healthcare came up frequently among vendors and attendees.  The Pervasive DataRush team featured live demo of the Benford’s Law Fraud Detection operator recently added to the Pervasive DataRush-Analytics library and the DataRush KNIME plug-ins package. Implementation and application of this operator is explained in detail in my blog titled “Finding a needle in a hay stack”, click here to read.

    Other topics of interest at HIMSS included Image Processing and Image Content Management. Image management solutions today use picture archiving and communication systems (PACS).  PACS is a combination of hardware and software dedicated to the short and long term storage, retrieval, management, distribution, and presentation of images. There are a lot of inefficiencies in image processing pipelines that could benefit from DataRush parallelism. Higher quality, efficient packaging and faster delivery of images results in earlier diagnosis and characterization of disease with the potential to improve patient outcomes and reduce costs.

    To summarize, we have found healthcare IT to be a great fit for DataRush parallelism engine and its analytics operator library. HIMSS was a successful event for Pervasive DataRush Team in the number of leads generated, the networking opportunities and the flow of ideas in the application of DataRush to operations and analytics in healthcare.

     

  • Cluster in a box

    Processors continue to become more "core dense" as we approach a watershed mark in multicore development. We've approached the point where a 4 processor system is taking on the characteristics of a "cluster in a box". Mike Hoskins, our CTO at Pervasive has deemed March 2010 the "Multicore Month of March" - a clever alliteration alluding to the latest processor announcements of Intel and AMD. Intel is launching their Nehalem EX with 8 cores and AMD their Magny Cours processors with up to 12 cores. This AMD blog outlines a contest to put an AMD 48-core system to good use. The winner to receive a free 48-core box.

    But what to do with all those cores? At Pervasive, we've been developing a platform called DataRush that helps developers of all skill levels write high performing, scalable Java code. Based on a dataflow architecture, the underlying framework has a very simple programming paradigm. Using a shared-nothing approach, the developer is freed from having to manage hundreds of threads, worry about synchronization and deadlock or deal with other parallel programming headaches.

    We recently came across a set of cluster-based benchmarks called MalStone. The MalStone Benchmark was developed by the Open Cloud Consortium (www.opencloudconsortium.com). For additional information, see www.opencloudconsortium.org/benchmarks. Even though these benchmarks are targeted at clusters, we decided to build a DataRush application for the MalStone B benchmark and run it on a 24 core system. The results were astounding!

    The MalStone B benchmark consists of a 10 billion row log file containing site-entity records. The file size approaches 1 Terabyte in size. The purpose of the benchmark is to compute a ratio for each site (w), per week (d), for all entities that visited the site, the percent of visits for which the entity became compromised at any time between the fisit and the end of week d. This type of processing is typical for cyber security applications.

    For our testing at Pervasive, we used a single machine consisting of 4 AMD 8435 processors running at 2.6 GHZ. This gives the machine a total of 24 cores. The total memory capacity is 64 Gigabytes. The I/O system uses a RAID filesystem consisting of 5 SATA drives. This machine does have a large memory capacity, however during the test we generally utilize about 9 Gigabytes for the JVM.

    Using this system we were able to achieve a 68 minute runtime for the MalStone B benchmark. That's approaching a Terabyte an hour of processing on a single node, inexpensive server-class machine. Contrast this runtime to the same test run on a 20-node cluster of 4-core boxes using Hadoop. The MalStone B runtime for the Hadoop cluster is 840 minutes. DataRush on a single node machine is 12 times faster than the small cluster! To be fair, the processers in the cluster nodes are not exactly comparable to the ones in the DataRush test box. This test still shows that an inexpensive, single server-class machine with software like DataRush that can take advantage of multicore processors can outperform distributed processing. Cluster in a box in action!

    This is very exciting when put in context of the next generation of multicore processors due out in the "Multicore Month of March". We look forward to running this and other benchmarks on these new systems. Check back soon for updated numbers as I have a sneaking suspicion that we'll easily break through the Terabyte an hour threshold with the MalStone benchmark using the latest Intel and AMD systems.

     

  • PMML validation

    Predictive Model Markup Language (PMML) is the leading standard for statistical and data mining models. PMML describes one or more structures of the data mining models in XML document with a root element of type PMML.

    Our Pervasive DataRush-Analytics project provides the following data mining models: AssociationModel, NaiveBayesModel, and RegressionModel.  The PMML generated from these models can be shared and exchanged from one environment to another, but the PMML needs to be validated against the schema to find any problems that may need to be fixed. 

    To guarantee validation, the Pervasive DataRush-Analytics model uses both XSD validation and XSLT validation as recommended by data mining group.

    First step:  XSD Validation :

    Get the PMML XSD 3.2 schema

    Here is an example of validating PMML file against PMML XSD schema:

    public void pmmlXSDValidate(String schemaPath, String sourcePath) {
    try {
    SchemaFactory factory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);Source schemaFile = new StreamSource(new File(schemaPath));

    Schema schema = factory.newSchema(schemaFile);

    Validator validator = schema.newValidator();

    validator.validate(new StreamSource(sourcePath));

    } catch (SAXException e) {

    ..........

    } catch (IOException e) {

    ..........

    }

    }

    XSD validation is a necessary part, but not sufficient by itself for determining if a PMML model is valid.

    Second step: XSLT Validation:

    Get the PMML XSLT style sheet

    Here is an example of XSLT validation.

    public void pmmlXSLTvalidate(String stylesheetPath, String sourcePath, String resultPath) {
    try {

    DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();

    //This setting will ignore the namespace

    docFactory.setNamespaceAware(false);

    DocumentBuilder parser = docFactory.newDocumentBuilder();

    Document document = parser.parse(
    new FileInputStream(sourcePath));

    Source pmmlSource = new DOMSource(document);

    Source xsltSource = new StreamSource(new FileInputStream(stylesheetPath));

    TransformerFactory transFactory = TransformerFactory.newInstance();

    Transformer transformer = transFactory.newTransformer(xsltSource);

    transformer .transform(pmmlSource , new StreamResult(resultPath));

    //check result after transformation

    ......................

    } catch (TransformerConfigurationException e) {

    ..............

    } catch (TransformerException e) {

    ..............

    }

    }

    It is possible that problems may still exist even if the PMML is validated, but running this test lowers the probability.  Once validated, Pervasive DataRush-Analytics models will provide specified results to help you analyze your business data and predict customer need.

More Posts Next page »