Pervasive DataRush

This blog is syndicated from the Pervasive DataRush site.

Fraud Detection and “Finding a needle in a haystack”

 

Fraud Detection and “Finding a needle in a haystack”

Benford’s law has been promoted as providing auditors with an automated tool that is simple and
effective for fraud detection. 

The law of anomalous numbers was published in 1938 by Frank Benford [1].  Benford noticed that if
a number is chosen at random from a large data set, the probability of the first digit of such number
is a 1 is about 30.1%, the probability that the first digit is a 2 is about 17.6%, and in general the
probability that the first digit  is d is log10(1 + 1/d).  This distribution was named after Frank Benford
as Benford’s Law, thou Canadian-American astronomer
Simon Newcomb [2] discovered this statistical
principle back in 1881.
The principle behind this curiosity in numbers is the distance between numbers in a logarithmic scale.


Figure 1 – Logarithmic Scaled Paper.

log10(2) - log10(1)  = distance between log of 1 and log of 2 is 0.3010. 

log10(3) – log10(2) = 0.1761

log10(4) – log10(3) = 0.1250  and so on…

The expected distribution represents the distance between digits in log scale as depicted in the
logarithmic scale, see Figure 2.

 
Figure 2 – Benford’s Law of First Digits Distributions and corresponding Logarithmic Scale.

Benford’s Law is the DNA-equivalent technique for financial analysis of financial datasets. Any data
that has been artificially generated will not follow this 1st digit distribution. As an example in Figure 3,
I have taken the first two columns in the census 2000 SF3 dataset [4].  I then used a random number
generator to add a third column which I called “incomeTotal”.  The black line below shows the
expected Benford’s Distribution of first digits.  The blue and green lines correspond to “housingTotal”
and “ownTotal” columns, first digit distributions.  Both of these columns follow Benford’s Law
represented by the black line.


Figure 3 – Benfords Law of First Digits on Census 200 Dataset Example.

Conversely, Benford’s Law analysis clearly indicates that a significant portion of “incomeTotal”
contains artificial numbers (See red line, Figure 3) . The desired result from such analysis is a list of
attributes or columns and the particular row(s) where the first digit counts deviate from the Benfords
distribution.  In this figure, column labeled “incomeTotal” and all its rows whose first digit begins with
digits 3 or 4 would be listed as possible fraudulent transactions. One disadvantage of this approach is
that a sufficient sample size of data is required before a meaningful distribution can be determined. 
As a first line of defense for automated artificial numbers filter, Benford’s law of first, second, third
and fourth digits is a widely accepted practice by auditors. 

A dataflow implementation of Benfords’ Law first partitions a large scale dataset by rows (horizontal
partitioning).  Subsets of rows are read, parsed and extracted in parallel using the DataRush
PartitionedDelimitedTextReader operator.  Each data row becomes a data token. Each data token
contains all fields, column values; for a single row. Multiple data tokens or rows are stored in each
partition as a DataRush RecordFlow. Data tokens flow downstream in the application graph (See
Figure 4).  As each data token arrives to the DigitCount operator, each column’s first digit counts are
binned (1 thru 9 bins). Binning and counting happens in parallel for each partition.  Figure 4 shows the
dataflow application graph for the DataRush implementation for scalable Benfords’ Law. Each dataflow
pipeline computes the first digit counts concurrently.  Then the dataflows are reduced to the final
probability distributions of first digits by aggregating the counts from each partition and normalizing
by the total counts.


Figure 4 – DataRush Benford’s Law Application Graph.

Figure 5 shows scalability of the parallel implementation across 1 thru 8 cores.


Figure 5 – Scalable Benford’s Law operator across 1 through 8 cores.

The red line in Figure 5, represents expected runtimes based on linear scalability across increasing
number of cores.  The blue line shows the preliminary results of the Benford’s Law DataRush Fraud
operator on the census 2000 dataset. Typically application runtimes using a single core are expected
to be reduced by one-half when increasing the number of cores from 1 to two.  Likewise, single core
application runtimes are expected to be reduced by one-fourth when increasing the number of cores
from 1 to 4.  DataRush applications automatically self tune at runtime based on available resources
to scale with number of cores.


Figure 6 – Benford’s Law Application Java Code.

DataRush framework facilitates the implementation of self-tuning, scalable parallel applications like
this Fraud Detection example.  The Fraud Detection application graph took approximately 6 hrs to
compose.  This typical implementation involved the use of existing DataRush I/O operators as wells as
the implementation of new “BenfordBinning” and “ReduceResults” Operators.  Figure 6 shows the
BenfordsLaw Class implementation.  The “VectorReader” operator calls DataRush PartitionedTextDelimitedReader operator to read and create “partitionCount” partitions of a dataset
based on current available cores.  The “For” loop within the “BIN” section of the code adds one
BenfordBinning operator per partition and names the worker thread “DigitCount”.  For example: at
any time on a 24-core computer, 24 “DigitCount” threads may be running concurrently.  Each thread
is running one instance of the “BenfordBinning” operator for a total of 24 concurrent binning calculations. 
In the “AGGREGATE” section of the code “RoundRobinUnpartitioner”; a part of DataRush core operators
library, is called to un-partition the dataflows in preparation for the final step “ReduceResults”. In this
final step, the counts for each digit and each column from the multiple partitions are summarized into
a global count for each of the nine bins (1-9).

The Benford’s Law of anomalous numbers has been broadly used in budget, income tax, financial and
forensic accounting analysis. This blog uses this well known and broadly used algorithm for Fraud
Detection as an example to demonstrate the benefits of the DataRush platform.  This application (see
Figure 6) is comprised of 71 new lines of code that fully utilize multicore computers. The DataRush
platform addresses gaps in design time cost, programming, parallelism, scalability and
performance/watt, enabling rapid prototyping of data and computational intensive applications.

 

References

[1]          Benford, F. "The Law of Anomalous Numbers." Proc. Amer. Phil. Soc. 78, 551-572, 1938.

[2]          Newcomb, S., http://en.wikipedia.org/wiki/Simon_Newcomb#Benford.27s_law

[3]          Nigrini, M.J., “Taxpayer compliance application of Benford’s law.” Journal of the American
Taxation Association, 18(1):72-92, 1996

[4]          Dataset: US Census 2000 Summary File 3 (SF3)
REF: http://www.census.gov/Press-Release/www/2002/sumfile3.html

 

 

Comments

No Comments

About n5712036

Dr.Nena M. Marín joined Pervasive Software Innovations Laboratories (iLabs) in September of 2008. Her research efforts in iLabs focus on Parallel Data Mining algorithms and their applications in business and science. Dr. Marín’s research interests include data intensive high performance computing, mathematical modeling and simulations of physiological systems, spectral pattern recognition for disease detection and drug delivery, bioformatics and Monte Carlo simulations in tissue photonics. Her most recent industry research interests include patterns in large scale and sparse datasets, clustering and unsupervised learning, collaborative filtering recommender systems and Marketing and Sales Optimization Churn analysis. Dr. Marín’s most recent work entitled “Pervasive Parallelism in Data Mining: Dataflow Solution to Co-clustering Large and Sparse Netflix Data” has been selected for presentation at the Knowledge Discovery and Data Mining (KDD) Conference July 2009 in Paris, Fr. She leads collaborations with Academic Partners focusing on bringing the power of commodity multi-core and parallel architecture into the hands of researchers to accelerate delivery of science. Dr. Marín is a National Science Foundation Fellow. After attaining both a Bachelor of Science Degree in Mechanical Engineering in 1984 and a Masters Degree in Mechanical Engineering in 1995, at the University of Texas at Austin, Dr. Marin was bestowed her Ph.D. in Biomedical Engineering at the University of Texas at Austin in 2005. Her Ph.D. research was funded by the National Institute of Health Program and focused on pattern recognition and automated data mining algorithms for cervical cancer detection. Dr, Marín worked as part of a multidisciplinary team in a Phase II Clinical Trial conducted at M.D. Anderson Cancer Center and the British Columbia Cancer Center in Vancouver, Canada.