We recently published a recommender system built on collaborative filtering principles. While collaborative filtering proved effective in predicting ratings of movies by users based on historical community movie ratings, we would like to consider a content-based filtering approach to enhance the accuracy of the recommendations. This blog demonstrates a standard experiment in information extraction.
Text mining of Netflix Prize movie titles: there were a total of 17770 movie titles in the Netflix Dataset. With the aim of automatically extracting possible hidden facts and discovering implicit links between actual movie titles and customer movie preferences, common text mining techniques were applied to extract and create word vector features. The word vectors are ranked using TF-IDF weight as the metric for word importance. Words that appear 17000 times or more were considered too frequent and therefore were pruned off. Words that appear twice or less were considered too infrequent and therefore were pruned off.
The following steps describe the text mining techniques:
1. A string tokenizer splits whole text into individual units or tokens. Tokenization uses Unicode specification to decide whether a character is a letter. All non-letter characters are assumed to be separators, thus the resulting tokens contain only letters.
2. All characters in a word are converted to lower case.
3. A filter is applied to remove stopwords. A stopword list is a list of words that are either insignificant (i.e., articles and prepositions) or so common that the results would overwhelm the analysis. I chose the WordNet stopword list as the list to ignore, a common practice in the field of information retrieval.
4. A token length filter is applied to remove words that are too short (3 characters or less)
5. Finally a Porter Stemmer is applied to map different grammatical forms of a word to a common term.
Results: 4542 regular attributes or individual words were identified using the described text mining techniques. The bar chart below shows the frequency distribution of words that appear 100 times or more in the Netflix movie titles. The most commonly used word in movie titles is “season”. The word “man” appears 1.5 times more than the word “girl”. The color “Blue” is the most used color name in a movie title.
Zipfian distribution
The ranking of each of the 4542 words was calculated based on term frequency to inverse document frequency. Figure 2 shows the proportional relationship between Netflix Movie Titles word frequency and their ranking.
Zipf’s law is demonstrated despite the small corpus. The red line corresponding to Netflix word frequency follows for higher frequency words (>450) closely the blue line (1/x). The most frequent word “season” occurs approximately 2.5 times the second most frequent word “live”. Each of the 4542 words become an attribute corresponding to per movie content. The year the movie was published is also available as part of the content. This movie content can then be used to build a content-based recommender system to predict ratings.