Pervasive DataRush

This blog is syndicated from the Pervasive DataRush site.

March 2010 - Posts

  • Fraud Detection and “Finding a needle in a haystack”

     

    Fraud Detection and “Finding a needle in a haystack”

    Benford’s law has been promoted as providing auditors with an automated tool that is simple and
    effective for fraud detection. 

    The law of anomalous numbers was published in 1938 by Frank Benford [1].  Benford noticed that if
    a number is chosen at random from a large data set, the probability of the first digit of such number
    is a 1 is about 30.1%, the probability that the first digit is a 2 is about 17.6%, and in general the
    probability that the first digit  is d is log10(1 + 1/d).  This distribution was named after Frank Benford
    as Benford’s Law, thou Canadian-American astronomer
    Simon Newcomb [2] discovered this statistical
    principle back in 1881.
    The principle behind this curiosity in numbers is the distance between numbers in a logarithmic scale.


    Figure 1 – Logarithmic Scaled Paper.

    log10(2) - log10(1)  = distance between log of 1 and log of 2 is 0.3010. 

    log10(3) – log10(2) = 0.1761

    log10(4) – log10(3) = 0.1250  and so on…

    The expected distribution represents the distance between digits in log scale as depicted in the
    logarithmic scale, see Figure 2.

     
    Figure 2 – Benford’s Law of First Digits Distributions and corresponding Logarithmic Scale.

    Benford’s Law is the DNA-equivalent technique for financial analysis of financial datasets. Any data
    that has been artificially generated will not follow this 1st digit distribution. As an example in Figure 3,
    I have taken the first two columns in the census 2000 SF3 dataset [4].  I then used a random number
    generator to add a third column which I called “incomeTotal”.  The black line below shows the
    expected Benford’s Distribution of first digits.  The blue and green lines correspond to “housingTotal”
    and “ownTotal” columns, first digit distributions.  Both of these columns follow Benford’s Law
    represented by the black line.


    Figure 3 – Benfords Law of First Digits on Census 200 Dataset Example.

    Conversely, Benford’s Law analysis clearly indicates that a significant portion of “incomeTotal”
    contains artificial numbers (See red line, Figure 3) . The desired result from such analysis is a list of
    attributes or columns and the particular row(s) where the first digit counts deviate from the Benfords
    distribution.  In this figure, column labeled “incomeTotal” and all its rows whose first digit begins with
    digits 3 or 4 would be listed as possible fraudulent transactions. One disadvantage of this approach is
    that a sufficient sample size of data is required before a meaningful distribution can be determined. 
    As a first line of defense for automated artificial numbers filter, Benford’s law of first, second, third
    and fourth digits is a widely accepted practice by auditors. 

    A dataflow implementation of Benfords’ Law first partitions a large scale dataset by rows (horizontal
    partitioning).  Subsets of rows are read, parsed and extracted in parallel using the DataRush
    PartitionedDelimitedTextReader operator.  Each data row becomes a data token. Each data token
    contains all fields, column values; for a single row. Multiple data tokens or rows are stored in each
    partition as a DataRush RecordFlow. Data tokens flow downstream in the application graph (See
    Figure 4).  As each data token arrives to the DigitCount operator, each column’s first digit counts are
    binned (1 thru 9 bins). Binning and counting happens in parallel for each partition.  Figure 4 shows the
    dataflow application graph for the DataRush implementation for scalable Benfords’ Law. Each dataflow
    pipeline computes the first digit counts concurrently.  Then the dataflows are reduced to the final
    probability distributions of first digits by aggregating the counts from each partition and normalizing
    by the total counts.


    Figure 4 – DataRush Benford’s Law Application Graph.

    Figure 5 shows scalability of the parallel implementation across 1 thru 8 cores.


    Figure 5 – Scalable Benford’s Law operator across 1 through 8 cores.

    The red line in Figure 5, represents expected runtimes based on linear scalability across increasing
    number of cores.  The blue line shows the preliminary results of the Benford’s Law DataRush Fraud
    operator on the census 2000 dataset. Typically application runtimes using a single core are expected
    to be reduced by one-half when increasing the number of cores from 1 to two.  Likewise, single core
    application runtimes are expected to be reduced by one-fourth when increasing the number of cores
    from 1 to 4.  DataRush applications automatically self tune at runtime based on available resources
    to scale with number of cores.


    Figure 6 – Benford’s Law Application Java Code.

    DataRush framework facilitates the implementation of self-tuning, scalable parallel applications like
    this Fraud Detection example.  The Fraud Detection application graph took approximately 6 hrs to
    compose.  This typical implementation involved the use of existing DataRush I/O operators as wells as
    the implementation of new “BenfordBinning” and “ReduceResults” Operators.  Figure 6 shows the
    BenfordsLaw Class implementation.  The “VectorReader” operator calls DataRush PartitionedTextDelimitedReader operator to read and create “partitionCount” partitions of a dataset
    based on current available cores.  The “For” loop within the “BIN” section of the code adds one
    BenfordBinning operator per partition and names the worker thread “DigitCount”.  For example: at
    any time on a 24-core computer, 24 “DigitCount” threads may be running concurrently.  Each thread
    is running one instance of the “BenfordBinning” operator for a total of 24 concurrent binning calculations. 
    In the “AGGREGATE” section of the code “RoundRobinUnpartitioner”; a part of DataRush core operators
    library, is called to un-partition the dataflows in preparation for the final step “ReduceResults”. In this
    final step, the counts for each digit and each column from the multiple partitions are summarized into
    a global count for each of the nine bins (1-9).

    The Benford’s Law of anomalous numbers has been broadly used in budget, income tax, financial and
    forensic accounting analysis. This blog uses this well known and broadly used algorithm for Fraud
    Detection as an example to demonstrate the benefits of the DataRush platform.  This application (see
    Figure 6) is comprised of 71 new lines of code that fully utilize multicore computers. The DataRush
    platform addresses gaps in design time cost, programming, parallelism, scalability and
    performance/watt, enabling rapid prototyping of data and computational intensive applications.

     

    References

    [1]          Benford, F. "The Law of Anomalous Numbers." Proc. Amer. Phil. Soc. 78, 551-572, 1938.

    [2]          Newcomb, S., http://en.wikipedia.org/wiki/Simon_Newcomb#Benford.27s_law

    [3]          Nigrini, M.J., “Taxpayer compliance application of Benford’s law.” Journal of the American
    Taxation Association, 18(1):72-92, 1996

    [4]          Dataset: US Census 2000 Summary File 3 (SF3)
    REF: http://www.census.gov/Press-Release/www/2002/sumfile3.html

     

     

  • What would I do with 48 cores?

    I mentioned in an earlier blog, the contest sponsored by AMD around the release of their 12-core 6100 series processors. With a 4p system that gives you 48 cores on a single box! This is an amazing amount of compute power in a very compact form. This latest line of processors from AMD includes 4 channels of DDR-3 memory and twice the memory throughput of their current products. This is an extremely important part of the AMD architecture as memory access can dominate today's memory hungry applications.

    So what would I do with a 48-core system? Working in the DataRush group at Pervasive, the first thing I would do is to benchmark DataRush on a really huge problem. The largest problem we've been working on lately is the Malstone-B benchmark. The MalStone Benchmark was developed by the Open Cloud Consortium (www.opencloudconsortium.com). For additional information, see www.opencloudconsortium.org/benchmarks. We've been running the Terabyte sized problem on a single-node box in well under an hour and would like to break that down even further.

    Running a benchmark may not sound historic or world changing. It could be said that it's not such an exciting thing to do with breakthrough technology, but the basis of this benchmark is cyber security: trolling through mountains of data looking for security breaches in networks. Put into perspective, computer networks form the backbone of our national defense structure. It is highly imperative to know the who/when/where and how of any intrusions on these networks. They are constantly under attack by hobbyist hackers, but also by professionals looking to compromise our national security.

    The amount of profilable data generated by these networks is huge and a processor such as the AMD 6100 series provides the needed muscle to power the analysis. A single machine composed of the 12-core 6100 series processors can provide the ability to crunch through multiple Terabytes of data an hour. This increased capacity allows much more data to be processed and also allows for more in-depth analysis. This includes finding patterns of abuse, modeling those patterns and enabling near-real time detection of network intrusions. As new algorithms are developed, the extra horsepower provided by more cores will be desperately needed. All on commercially available hardware, impressive!

    Software such as DataRush is a perfect match for this new line of processors. DataRush can help exploit all 48 cores on this AMD system and will gain a huge boost from the memory throughput improvements. Hardware and software working together in the new multicore world can do amazing things!

     

  • Going really fast ...

    In a recent blog, Robin Bloor discusses how parallelism is required for software to go really fast on todays multicore computers. He brings up this point about MapReduce: using MapReduce on problems it wasn't intended to solve is "... like playing golf with a single club". I'd like to expound a bit on this analysis.

    MapReduce was most famously implemented by Google to fulfill their need to index the world wide web. Quite an undertaking! And MapReduce proved to be critical to their success. The programming model for MapReduce fits perfectly with the problem of finding words within documents and creating indices for later (very fast) lookup.

    However, the MapReduce programming model can be limited when applied to other, more complex problems. Many deep data analysis algorithms require multiple, complex steps to produce their output. In these cases, a more general use programming paradigm is a better and more efficient fit. Hence, Robin's analogy to "... playing golf with a single club".

    Robin goes on to discuss DataRush and the capabilities it brings to bear. Based on a dataflow architecture, the programming model of DataRush is much more flexible and general use than MapReduce. I wouldn't use DataRush to index all the content of the internet, but it has proven to be an excellent tool for general data processing and data mining. And it has the ability to utilize all of the cores available on today's multicore systems. Put into perspective, we have benchmarks showing Terabyte an hour (and even better) processing of network log data for a cyber security application on a single box.

    So does playing golf with only one club mean you can't play golf? Of course not. But it does mean you can't play as well or as efficiently as if you used all the clubs at your disposal. The flexibility of DataRush allows you to utilize a full programming paradigm especially suited to big data problems.

    Robin has a knack for the turn of a phrase. Check him out on Twitter. He also has a very funny (and informative) book out called "Words You Don't Know". I especially like the chapter on swear words. Smile

  • HIMSS 2010 - DataRush in Health IT making a difference in patient care

    The DataRush team just returned from attending the HIMSS 2010 conference in Atlanta (March 1-4th).

     

    HIMSS is traditionally a large conference. As massive as the Atlanta convention center is, this year once again HIMSS filled the venues. Total attendance on March 2nd was recorded at 27,451 with 13,530 as exhibitors. This years’ theme was “Health IT (HIT) making a difference in patient care”. The DataRush team staked its camp at booth#3325 where it visited with presenters, vendors and attendees discussing large data pains and synergies between interoperable health IT and the DataRush parallelism engine.

    Day 1:  Started early with a lunch briefing with a global leader IT services. The architect of the largest healthcare informatics data warehouse shared his experiences with content management and searchable health information. Performance bottlenecks are common in data warehousing (DW) and BI data loading and consumption. Additionally, consolidation of data from up to 40 member companies had its challenges with speed and accuracy of data de-duplication.  In reference to DW loading, Pervasive DataRush offers a platform that leverages multi-core technology enabling unprecedented data throughputs.  In response to data quality and consolidation, Pervasive DataMatcher combines the power of DataRush parallelism engine with sophisticated parametric string matching algorithms to provide the highest precision (fidelity) and recall (completeness) in fuzzy matching.

    Once the exhibitor’s area opened midmorning, DataRush booth featured live demos of Pervasive DataRush Analytics plug-ins for KNIME.  KNIME (Konstanz Information Miner) is an open source data mining tool.  The upcoming DataRush release 5.0 provides KNIME nodes that offer DataRush powered versions of several data mining algorithms. The KNIME interface provides a user friendly workflow-like environment for data miner practitioners to rapidly orchestrate and deploy predictive analysis. Healthcare system integrators delivering EMR and EHR based decision support are turning to predictive analytics and data mining. Pervasive DataRush speeds and accuracy combined with the KNIME interface enables instant processing and response from Tera-scale heterogeneous data repositories.

    Day 2:  Booked back-to-back with briefings to target specialized areas in Healthcare data exchange, interoperability, network support and cyber security. Healthcare data exchange deals with transformations between End-User formats and ANSI EDI, HL7, etc. standards.  These transformations are facilitated via ETL tools.  Pervasive DataRush powered schema-to-schema transformations supports healthcare standards as defined by the American National Standards Institute (ANSI).

    Within cyber-security, Pervasive DataRush presented results from a benchmark study based on Malstone algorithm for web exploits. The Malstone DataRush implementation is capable of processing 1Tb of web log files per hr. Furthermore, DataRush is 12 times faster on 19 less computers when compared to the Malstone algorithm implementation using MapReduce on a 20 node cluster. Customers requiring enhanced cyber security discussed their bottlenecks in vulnerability assessments. An important task in cyber security involves analyzing large volumes of network data against databases of Common Vulnerabilities and Exposures (CVE) software flaws, Common Configuration Enumeration (CCE) mis-configurations, etc. and then scoring them based on National Institute of Standards and Technology (NIST) severity scoring system for vulnerability.  The Common Vulnerability Scoring System (CVSS) provides an open framework for communicating the characteristics and impacts of IT vulnerabilities. DataRush offers data parallelism where subsets of network data can be concurrently read in blocks and processed. The processing here refers to concurrent computation of CVSS core equations for “AccessComplexity”, “Authentication”, “AccessVector”, “ConfidentialityImpact” and “AvailabilityImpact”. The final step in the processing pipeline is to aggregate these scores to produce the CVSS “BaseScore”.

    Regarding network support and interoperability, discussions were centered on DataRush implementations of parallelized encryption and decryption algorithms as well as baseline change detection algorithms to detect anomalies in network log files.  These are novel applications of DataRush driven entirely by customer need.

    Day 3:  Had the most traffic through our booth.  Fraud, waste and abuse in healthcare came up frequently among vendors and attendees.  The Pervasive DataRush team featured live demo of the Benford’s Law Fraud Detection operator recently added to the Pervasive DataRush-Analytics library and the DataRush KNIME plug-ins package. Implementation and application of this operator is explained in detail in my blog titled “Finding a needle in a hay stack”, click here to read.

    Other topics of interest at HIMSS included Image Processing and Image Content Management. Image management solutions today use picture archiving and communication systems (PACS).  PACS is a combination of hardware and software dedicated to the short and long term storage, retrieval, management, distribution, and presentation of images. There are a lot of inefficiencies in image processing pipelines that could benefit from DataRush parallelism. Higher quality, efficient packaging and faster delivery of images results in earlier diagnosis and characterization of disease with the potential to improve patient outcomes and reduce costs.

    To summarize, we have found healthcare IT to be a great fit for DataRush parallelism engine and its analytics operator library. HIMSS was a successful event for Pervasive DataRush Team in the number of leads generated, the networking opportunities and the flow of ideas in the application of DataRush to operations and analytics in healthcare.

     

  • Cluster in a box

    Processors continue to become more "core dense" as we approach a watershed mark in multicore development. We've approached the point where a 4 processor system is taking on the characteristics of a "cluster in a box". Mike Hoskins, our CTO at Pervasive has deemed March 2010 the "Multicore Month of March" - a clever alliteration alluding to the latest processor announcements of Intel and AMD. Intel is launching their Nehalem EX with 8 cores and AMD their Magny Cours processors with up to 12 cores. This AMD blog outlines a contest to put an AMD 48-core system to good use. The winner to receive a free 48-core box.

    But what to do with all those cores? At Pervasive, we've been developing a platform called DataRush that helps developers of all skill levels write high performing, scalable Java code. Based on a dataflow architecture, the underlying framework has a very simple programming paradigm. Using a shared-nothing approach, the developer is freed from having to manage hundreds of threads, worry about synchronization and deadlock or deal with other parallel programming headaches.

    We recently came across a set of cluster-based benchmarks called MalStone. The MalStone Benchmark was developed by the Open Cloud Consortium (www.opencloudconsortium.com). For additional information, see www.opencloudconsortium.org/benchmarks. Even though these benchmarks are targeted at clusters, we decided to build a DataRush application for the MalStone B benchmark and run it on a 24 core system. The results were astounding!

    The MalStone B benchmark consists of a 10 billion row log file containing site-entity records. The file size approaches 1 Terabyte in size. The purpose of the benchmark is to compute a ratio for each site (w), per week (d), for all entities that visited the site, the percent of visits for which the entity became compromised at any time between the fisit and the end of week d. This type of processing is typical for cyber security applications.

    For our testing at Pervasive, we used a single machine consisting of 4 AMD 8435 processors running at 2.6 GHZ. This gives the machine a total of 24 cores. The total memory capacity is 64 Gigabytes. The I/O system uses a RAID filesystem consisting of 5 SATA drives. This machine does have a large memory capacity, however during the test we generally utilize about 9 Gigabytes for the JVM.

    Using this system we were able to achieve a 68 minute runtime for the MalStone B benchmark. That's approaching a Terabyte an hour of processing on a single node, inexpensive server-class machine. Contrast this runtime to the same test run on a 20-node cluster of 4-core boxes using Hadoop. The MalStone B runtime for the Hadoop cluster is 840 minutes. DataRush on a single node machine is 12 times faster than the small cluster! To be fair, the processers in the cluster nodes are not exactly comparable to the ones in the DataRush test box. This test still shows that an inexpensive, single server-class machine with software like DataRush that can take advantage of multicore processors can outperform distributed processing. Cluster in a box in action!

    This is very exciting when put in context of the next generation of multicore processors due out in the "Multicore Month of March". We look forward to running this and other benchmarks on these new systems. Check back soon for updated numbers as I have a sneaking suspicion that we'll easily break through the Terabyte an hour threshold with the MalStone benchmark using the latest Intel and AMD systems.

     

More Posts