Pervasive DataRush

This blog is syndicated from the Pervasive DataRush site.

September 2009 - Posts

  • The Building Block of Predictive Analytics

    We are evolving our Pervasive DataRush (PDR) platform to deliver a building block of predictive analytics: a collaborative filter.  Business gurus know that understanding customer behavior can only benefit a company, and proper predictive analytics will only add value to the bottom line.  Every 5% increase in retention equals a 25% to 100% increase in profitability.1 

    How can PDR’s collaborative filter application increase my sales? By using a leading edge co-clustering algorithm2, the PDR collaborative filter finds neighborhoods of like-minded customers. These clusters, co-occurrence of like-minded customers versus products, can be used to predict what other products a customer may have the highest likelihood to purchase. This innovative use of customer latent behavior provides powerful insight and enables cross-selling and up-selling to customers as they browse company web sites or talk to a customer representatives.

    Built upon the solid foundation of the DataRush engine, the collaborative filter application runs extremely fast. For example, running the Netflix challenge data through 20 iterations took only 17 minutes on a commodity 16-core server. That’s 17 minutes to build a complex model of 480,189 users and recommend 17,770  movies!

    Fast is great, but accuracy has to come first. With DataRush, you can achieve both. Again, referring to the Netflix data, our collaborative filter was able to achieve an approximate 8.18% (on training set) increase in accuracy over the baseline system (Cinematch; Netflix production system). Imagine what it can do for your data.  It’s like having the help of a psychic for each customer! 

    We will be announcing our collaborative filter very soon and hope to start working on a content-based filtering algorithm. 


    1The Ultimate Question by Frederick Reicheld of Bain & Co.  
    2Maximum Entropy approach to Bregman Co-clustering by researchers at UT Austin IDEAL lab.

     

  • Text Mining to extract content from Netflix Prize Movie Titles

    We recently published a recommender system built on collaborative filtering principles. While collaborative filtering proved effective in predicting ratings of movies by users based on historical community movie ratings, we would like to consider a content-based filtering approach to enhance the accuracy of the recommendations.  This blog demonstrates a standard experiment in information extraction.

    Text mining of Netflix Prize movie titles: there were a total of 17770 movie titles in the Netflix Dataset. With the aim of automatically extracting possible hidden facts and discovering implicit links between actual movie titles and customer movie preferences, common text mining techniques were applied to extract and create word vector features. The word vectors are ranked using TF-IDF weight as the metric for word importance.  Words that appear 17000 times or more were considered too frequent and therefore were pruned off.  Words that appear twice or less were considered too infrequent and therefore were pruned off.
    The following steps describe the text mining techniques:
    1. A string tokenizer splits whole text into individual units or tokens. Tokenization uses Unicode specification to decide whether a character is a letter. All non-letter characters are assumed to be separators, thus the resulting tokens contain only letters.
    2. All characters in a word are converted to lower case.
    3. A filter is applied to remove stopwords. A stopword list is a list of words that are either insignificant (i.e., articles and prepositions) or so common that the results would overwhelm the analysis. I chose the WordNet stopword list as the list to ignore, a common practice in the field of information retrieval.
    4. A token length filter is applied to remove words that are too short (3 characters or less)
    5. Finally a Porter Stemmer is applied to map different grammatical forms of a word to a common term.

    Results:  4542 regular attributes or individual words were identified using the described text mining techniques.  The bar chart below shows the frequency distribution of words that appear 100 times or more in the Netflix movie titles.  The most commonly used word in movie titles is “season”. The word “man” appears 1.5 times more than the word “girl”.  The color “Blue” is the most used color name in a movie title.

    Figure 1 -  

     

     

     

     

     

     

     

      

    Zipfian distribution
    The ranking of each of the 4542 words was calculated based on term frequency to inverse document frequency. Figure 2 shows the proportional relationship between Netflix Movie Titles word frequency and their ranking.

    Figure 2- Zipf's Law 


    Zipf’s law is demonstrated despite the small corpus.  The red line corresponding to Netflix word frequency follows for higher frequency words (>450) closely the blue line (1/x). The most frequent word “season” occurs approximately 2.5 times the second most frequent word “live”. Each of the 4542 words become an attribute corresponding to per movie content.  The year the movie was published is also available as part of the content.  This movie content can then be used to build a content-based recommender system to predict ratings.


     

  • Applications of java.lang.reflect.Proxy

    The Proxy class is one of the cooler features of the Java SDK. In short, it allows you to create a dynamic implementation of a statically defined interface.  This is useful in cases where you wish to have type safety at the interface level, while still having a general-purpose implemention.

     

    To use Proxy, you must give the Proxy class an interface to implement and an “InvocationHandler”. The InvocationHandler is a callback that you implement that provides the behavior for the class. The Proxy class dynamically generates an implementation of that interface that dispatches all method invocations to the “invoke” method of your InvocationHandler.

     

    This has a number of applications such as RMI, Object-Relational mapping, AOP, etc. It’s also possible to create a full implementation of an interface where the behavior is determined by convention alone.

     

    For example, one can create a BeanProxy that automatically implements a bean ( assuming only setters and getters ). As a client of BeanProxy, I can simply define at interface:

        
    
        public interface ExampleInterface {
            //…
            public String getStringValue();
            public void setStringValue(String value);
        }
    
    
    I can now create an instance of that interface and use it as follows:
            
    
            ExampleInterface inter = BeanProxy.createBean(ExampleInterface.class);
            inter.setStringValue("mystr");
            System.out.println(inter.getStringValue());  //prints "mystr"
    
    
    No implementation of the interface required! 

    Here’s how it works. The following is the BeanInvocationHandler ( see the attached BeanProxy for the full source code ):

        private static class BeanInvocationHandler implements InvocationHandler {  
    
            //…
          
            /**
             * Backing store for property values 
             */
            private final Map propertyValues = new HashMap();
            
            //…        
            
            @Override
            public Object invoke(Object proxy, Method method, Object[] args)
                    throws Throwable {
                if ( isGetter( method ) ) {
                    //the method is a getter--get the value of the property
                    Object rv = propertyValues.get(getPropertyName(method));
                    //…(special processing for primitive types)
                    return rv;
                }
                else if ( isSetter( method ) ) {
                    //the method is a setter--store the property in the map
                    propertyValues.put(getPropertyName(method), args[0]);
                    return null;
                }
                else {
                    throw new UnsupportedOperationException("Unable to handle method: "+method);
                }
            }
            //…
            
        }
    
    

    When I make a call to inter.setStringValue(“mystr”), the following occur:

    1)BeanInvocationHandler.invoke is called with a method parameter equal to the “setStringValue” method object and an args Object [] equal to

    new Object[]{“mystr”}

    2)The method is a setter ( isSetter returns true ), so we add an entry “StringValue=mystr” to the propertyValues HashMap.

     

    When I then make a subsequent call to inter.getStringValue(), the following occur:

    1)BeanInvocationHandler.invoke is called with a method parameter equal to the “getStringValue” method object.

    2)The method is a getter ( isGetter returns true ), so we fetch “StringValue” from the propertyValues HashMap, which is equal to “mystr”.

    3)The “invoke” method returns “mystr”.

    4)“mystr” is returned from the call inter.getStringValue() and we display “mystr”.

     

    One caveat: beware of performance implications of proxy. Every method invocation requires at a minimum the creation of the args Object[]. In addition, primitive typed arguments must be boxed for each method invocation!

     

  • Density-Based Clustering

    Data Mining and the knowledge discovery process seeks to find patterns in data.  A common approach to pattern recognition is clustering.  In recently published work (KDD2009), a dataflow implementation of kmeans and later two dimensional kmeans (co-clustering) identified like-minded customer rental patterns of Netflix movies.  But what happens when only part of the data clusters well?  This is a first of a series of articles describing density-based clustering.   The concept of density-based clustering emerged from the need to identify fewer and tighter clusters in data.  In 1914 Astronomer H.N. Russell introduced the HR diagram, which for the first time classified stars into two natural groups based on luminosity and temperature: Giants and Dwarfs.  This was the basis for D. Wishart “Hierarchical Mode Analysis”, the first algorithm capable of clustering a compact hierarchy of dense clusters.   I am fascinated by Astronomy, and by fate my next project in collaboration with the Texas Advanced Computing Center (TACC) and the University of Texas at Austin (UT-IDEAL) is on identifying regions of interest in very large cosmological datasets using TACC’s Teragrid and our DataRush parallelism engine (using dataflow computational model).  The very large cosmological dataset to be analyzed comes from an ongoing HPC TACC project led by Tom Quinn, University of Washington Astronomy professor, N-Body shop director and Principal Investigator.  See the HPCwire article on Quinn’s project with TACC: “Matching Simulation to Reality in the Life of the Universe” by Aaron Dubrow.  

    So let me begin this blogging series by providing an abstract of the “HALO project”
    How do planets form? This most basic question of astrophysics continues to stump astronomers.  Cosmological discoveries emerge for scientist through the visualization stage of the computing process, when data from a predictive modeling simulation is reinterpreted as dynamic, information-rich images.   One of the main problems hampering progress is the enormous range of scale and the amount of information: 33 million particles at a million time-steps over the life of the universe.  Typically the first phase of discovery requires identifying possible star formations in the form of halos based on spatial and density variations over different resolutions/scales.  We present an automated clustering and visualization framework that combines Teragrid resources and a scalable density based clustering algorithm: “Hierarchical Density Shaving” for large cosmological datasets.

     

More Posts