Pervasive DataRush

This blog is syndicated from the Pervasive DataRush site.

February 2010 - Posts

  • PMML validation

    Predictive Model Markup Language (PMML) is the leading standard for statistical and data mining models. PMML describes one or more structures of the data mining models in XML document with a root element of type PMML.

    Our Pervasive DataRush-Analytics project provides the following data mining models: AssociationModel, NaiveBayesModel, and RegressionModel.  The PMML generated from these models can be shared and exchanged from one environment to another, but the PMML needs to be validated against the schema to find any problems that may need to be fixed. 

    To guarantee validation, the Pervasive DataRush-Analytics model uses both XSD validation and XSLT validation as recommended by data mining group.

    First step:  XSD Validation :

    Get the PMML XSD 3.2 schema

    Here is an example of validating PMML file against PMML XSD schema:

    public void pmmlXSDValidate(String schemaPath, String sourcePath) {
    try {
    SchemaFactory factory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);Source schemaFile = new StreamSource(new File(schemaPath));

    Schema schema = factory.newSchema(schemaFile);

    Validator validator = schema.newValidator();

    validator.validate(new StreamSource(sourcePath));

    } catch (SAXException e) {

    ..........

    } catch (IOException e) {

    ..........

    }

    }

    XSD validation is a necessary part, but not sufficient by itself for determining if a PMML model is valid.

    Second step: XSLT Validation:

    Get the PMML XSLT style sheet

    Here is an example of XSLT validation.

    public void pmmlXSLTvalidate(String stylesheetPath, String sourcePath, String resultPath) {
    try {

    DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();

    //This setting will ignore the namespace

    docFactory.setNamespaceAware(false);

    DocumentBuilder parser = docFactory.newDocumentBuilder();

    Document document = parser.parse(
    new FileInputStream(sourcePath));

    Source pmmlSource = new DOMSource(document);

    Source xsltSource = new StreamSource(new FileInputStream(stylesheetPath));

    TransformerFactory transFactory = TransformerFactory.newInstance();

    Transformer transformer = transFactory.newTransformer(xsltSource);

    transformer .transform(pmmlSource , new StreamResult(resultPath));

    //check result after transformation

    ......................

    } catch (TransformerConfigurationException e) {

    ..............

    } catch (TransformerException e) {

    ..............

    }

    }

    It is possible that problems may still exist even if the PMML is validated, but running this test lowers the probability.  Once validated, Pervasive DataRush-Analytics models will provide specified results to help you analyze your business data and predict customer need.

  • DataRush Video Analytics

    Digital video has become the face of television, the internet and mobile devices.  According to an official blog post (May 2009), about 20 hours of video are introduced to the YouTube site every minute of real time.  This is equivalent to Hollywood releasing over 114,000 new full-length movies into the theaters each week! But digital video also plays a huge role in biomedical devices, surveillance and manufacturing quality assurance.

    Did you know there are approximately eight million users sharing 10 petabytes of data (mostly media files) at any given time? This accounts for nearly 10% of the worldwide internet broadband connections [1].  So how can near real-time actionable intelligence be gleaned from the vast amounts of video data being generated? One answer is to exploit the power of emerging commodity multicore computers. When used properly, each core can be used for individual threads of computations, but new software applications will need to be developed to make this happen.  Today there is a parallel programming gap between multicore systems and software applications.  With the end of the uni-processor performance gains, the average software developer will have to implement parallel programs to maintain performance growth.  The goal in parallel computing is to perform multiple calculations simultaneously.  The Pervasive DataRush™ (DataRush) platform exploits multiple forms of parallelism facilitating concurrency in video processing and video analytics from spatial-temporal partitioning and down to the pixel level.

    There are several paths to parallelism given languages and programming frameworks available today, but a very common path to parallelism today is data parallelism.  Data parallelism is a simple divide-and-conquer technique emerged from SPMD (single program, multiple data) where data is partitioned and distributed over multiple workers (nodes on a cluster, vm’s on a cloud) each running the same program.  Hadoop, an open source version of MapReduce, logically partitions the data and allocates one map task, called a Mapper, per partition. There may be hundreds of Mappers on a single machine. A single threaded legacy application can be deployed to subsets of a large scale datasets on a cloud or grid environment.  A second path is coarse grain parallelism via parallelization of loops (TPL: Parallel.For, Parallel.ForEach and RParallel: runParallel), arrays (ParallelArrays and INVOKE-IN-PARALLEL) and further orchestration onto multiple workers. True fine grain parallelism requires writing complex and correct multithreaded programs. Fine grain parallelism here refers to thread-level parallelism (not instruction level parallelism). 

     

      

     

    This figure is a cartoon depiction of a data pipeline for Video Object Detection using principal component analysis (PCA) for background subtraction.  By projecting the original frame onto its eigenspace and subtracting projected image from original image, foreground objects are clearly identifiable.  This work is based on Yilmaz et al (2006). A white paper detailing this work can be found here.

    Video analytics can also be used in medicine for guided surgery and video tele-monitoring of patients.  A  use case (see Figure 2 below) and task in this video processing pipeline is the identification of regions of interests for physicians and clinicians  decision support.  DataRush parallelism has being applied in experimentations with K-Means clustering of digital colposcopic images to identify acetic acid enhanced pre-cancerous legions (highlighted in red below).   This image analysis can be applied concurrently to individual video frames in order to identify and label ROI's.

     

    The DataRush platform is designed specifically to fully utilize emerging commodity multicore computers. It addresses gaps in design time cost, programming, parallelism, scalability and performance/watt, enabling rapid prototyping of video processing applications. 

    The volumes of video streaming onto television, computers and mobile are forcing video processing onto cloud and distributed environments. Current cloud and grid computing platforms are still not capable of real time processing.  Video processing is inherently parallel and the current solutions mostly leverage data parallelism.  Emergent fine grain parallelism in video processing exploits concurrency at slice-level, frame-level, intra-frame and pixel-level operations. Such fine granularity has been traditionally achieved using video encoding hardware. This hardware based approach lacks flexibility. Our approach introduces a Video processing development platform to exploit multiple levels of parallelism while facilitating rapid development of agile and adaptive video analytical models.

     

     

  • What could you do with a 100x performance improvement?

    Traditionally Java Performance has always been a bit of a misnomer.  In the early days of Java 1.1, performance was a secondary consideration after ease of programming and usability.  But the last few years have seen some amazing performance enhancements in the JVM.  Escape Analysis, Compressed References, and other JDK7 performance enhancements.  I'm not even including the myriad of smaller changes that the Sun JVM engineers are working on.

    As a Java developer, you don't have to wait for JVM improvements to get better performance out of your code.  There are best practices on how to use the Java language (Java Concurrency, Java Performance Tuning) and specialized hardware such as Azul Systems that can help.  But what if after doing all sorts of optimizations you're code STILL isn't running as fast as you want it?  Especially with regards to using multiple cores, which can be difficult to fully utilize.

    The DataRush team has been working on DataRush 5.0 which is scheduled to be released sometime in the middle of 2010.  Until the release of the new version of DataRush, I thought that showing you some of the performance that we're seeing with the current builds.  All of the applications were run on the following system config:

    • 4p/24c AMD Opteron 8435 2.6GHz
    • 64gb DDR2
    • Windows 2008 R2 64bit
    • Java6_u16 64bit Server JVM

    The following algorithms were run:

    The run time for each algorithm were:

    • Naive Bayes
      • Learner - 3.6 seconds
      • Predictor - 7.8 seconds
    • Kmeans - 3.2 seconds

    The DataRush engine fully utilized all twenty four cores through the complete run!  The Naive Bayes algorithm was run on an 8gb data file, while the Kmeans was run with a 3.2gb data file.  There are more algorithms that the DataRush team is working on and these are the first set of impressive results. 

    What does this mean for your business?  Imagine being able to do complex calculations, analytic and other algorithms in seconds!  Instead of waiting to run calculations overnight, you could run computations as needed and as often as needed.  Update models in near real time, taking different inputs and view results as they happen to make better business decisions based on your whole data set instead of a sample.  If this sounds interesting, head over to Pervasive DataRush to get your two week trial of DataRush.

More Posts