Pervasive DataRush

This blog is syndicated from the Pervasive DataRush site.

Text Mining to extract content from Netflix Prize Movie Titles

We recently published a recommender system built on collaborative filtering principles. While collaborative filtering proved effective in predicting ratings of movies by users based on historical community movie ratings, we would like to consider a content-based filtering approach to enhance the accuracy of the recommendations.  This blog demonstrates a standard experiment in information extraction.

Text mining of Netflix Prize movie titles: there were a total of 17770 movie titles in the Netflix Dataset. With the aim of automatically extracting possible hidden facts and discovering implicit links between actual movie titles and customer movie preferences, common text mining techniques were applied to extract and create word vector features. The word vectors are ranked using TF-IDF weight as the metric for word importance.  Words that appear 17000 times or more were considered too frequent and therefore were pruned off.  Words that appear twice or less were considered too infrequent and therefore were pruned off.
The following steps describe the text mining techniques:
1. A string tokenizer splits whole text into individual units or tokens. Tokenization uses Unicode specification to decide whether a character is a letter. All non-letter characters are assumed to be separators, thus the resulting tokens contain only letters.
2. All characters in a word are converted to lower case.
3. A filter is applied to remove stopwords. A stopword list is a list of words that are either insignificant (i.e., articles and prepositions) or so common that the results would overwhelm the analysis. I chose the WordNet stopword list as the list to ignore, a common practice in the field of information retrieval.
4. A token length filter is applied to remove words that are too short (3 characters or less)
5. Finally a Porter Stemmer is applied to map different grammatical forms of a word to a common term.

Results:  4542 regular attributes or individual words were identified using the described text mining techniques.  The bar chart below shows the frequency distribution of words that appear 100 times or more in the Netflix movie titles.  The most commonly used word in movie titles is “season”. The word “man” appears 1.5 times more than the word “girl”.  The color “Blue” is the most used color name in a movie title.

Figure 1 -  

 

 

 

 

 

 

 

  

Zipfian distribution
The ranking of each of the 4542 words was calculated based on term frequency to inverse document frequency. Figure 2 shows the proportional relationship between Netflix Movie Titles word frequency and their ranking.

Figure 2- Zipf's Law 


Zipf’s law is demonstrated despite the small corpus.  The red line corresponding to Netflix word frequency follows for higher frequency words (>450) closely the blue line (1/x). The most frequent word “season” occurs approximately 2.5 times the second most frequent word “live”. Each of the 4542 words become an attribute corresponding to per movie content.  The year the movie was published is also available as part of the content.  This movie content can then be used to build a content-based recommender system to predict ratings.


 

Comments

No Comments

About n5712036

Dr.Nena M. Marín joined Pervasive Software Innovations Laboratories (iLabs) in September of 2008. Her research efforts in iLabs focus on Parallel Data Mining algorithms and their applications in business and science. Dr. Marín’s research interests include data intensive high performance computing, mathematical modeling and simulations of physiological systems, spectral pattern recognition for disease detection and drug delivery, bioformatics and Monte Carlo simulations in tissue photonics. Her most recent industry research interests include patterns in large scale and sparse datasets, clustering and unsupervised learning, collaborative filtering recommender systems and Marketing and Sales Optimization Churn analysis. Dr. Marín’s most recent work entitled “Pervasive Parallelism in Data Mining: Dataflow Solution to Co-clustering Large and Sparse Netflix Data” has been selected for presentation at the Knowledge Discovery and Data Mining (KDD) Conference July 2009 in Paris, Fr. She leads collaborations with Academic Partners focusing on bringing the power of commodity multi-core and parallel architecture into the hands of researchers to accelerate delivery of science. Dr. Marín is a National Science Foundation Fellow. After attaining both a Bachelor of Science Degree in Mechanical Engineering in 1984 and a Masters Degree in Mechanical Engineering in 1995, at the University of Texas at Austin, Dr. Marin was bestowed her Ph.D. in Biomedical Engineering at the University of Texas at Austin in 2005. Her Ph.D. research was funded by the National Institute of Health Program and focused on pattern recognition and automated data mining algorithms for cervical cancer detection. Dr, Marín worked as part of a multidisciplinary team in a Phase II Clinical Trial conducted at M.D. Anderson Cancer Center and the British Columbia Cancer Center in Vancouver, Canada.