The Fifteenth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD’09) in Paris, France was last week.
The annual ACM SIGKDD conference is the premier international
forum for data mining researchers and practitioners from academia,
industry, and government to share their ideas, research results and
experiences. For several reasons, this year’s KDD was special. First,
it received a record number of 659 submissions, more than 10% up
from last year. Second, it marks the first time the event was held in
Europe - beautiful Paris! Third, it received a record number of 4741
complete valid entries out of 7877 total entries in the KDD Cup 2009
Pervasive Software had a team of five people in attendance and
sponsored exhibitor booth#4 in the “Foyer Rives de Siene”.
In the booth, we ran demonstrations of our KDD’09 featured project
“Pervasive’s Dataflow Solution to the Netflix Recommender System”
on a HP 4-core (Quad core Intel Q9300 @2.53GHz) laptop.
On Monday June 29th, the plenary invited talk by David Hand attacked
the widespread use of AUC by exposing its fundamental incoherences
and closing with a family of coherent alternative scores. The evening Gala
and Poster Session was hosted by the Mayor of Paris “Bertrand Delanoë”
at the site of the City Hall of Paris: “Hotel de Ville”. The poster on
MapReduce cluster implementation of the Bayesian Browsing Model for
Petabyte-scale click log data was most interesting to me. Using exact
inference and single pass algorithm implementation, it reportedly
performs 1.15 billion queries in 3 hrs.
On Tuesday June 30th, a big day for Pervasive Software at KDD 2009!
Pervasive Software was recognized as technology leader on two fronts:
selected for panel on open standards in data mining (PMML) and
selected for industrial research talk/presentation on parallel data mining.
Panel: Open Standards and Cloud Computing
Pervasive’s own CTO Mike Hoskins joins a distinguished panel of thought leaders
in the data mining industry including DMG / Open Data Group, IBM, KNIME, KXEN,
Microstrategy, SAS, SPSS and Pervasive Software. Mike’s presentation included:
Talk: Pervasive Parallelism in Data Mining
Industry Session Presentation that unveils unmatched runtime performance
for its Dataflow Solution to the Netflix Recommender System. The bottom
line for this research is that the Pervasive DataRush parallelism engine
produced movie recommendations with comparable accuracy 9-44 times
faster than top Netflix Prize solutions. Our solution predicts 100 million
ratings in 16.31 minutes and achieved an RMSE of 0.88846.
Most of the feedback I received on the presentation happened during
the evening poster sessions. From the academic perspective, parallel
data mining is a hot topic. The scientific community will no longer settle
for sampling or weeks of processing times. At the SciDAC 2009
Conference in San Diego three weeks ago, scientists are solving
exascale data and computational problems with parallelism and
149,504 processing cores (Jaguar XT5). From the industrial
perspective, the volumes of data, high cost of power and facilities
has brought about infrastructure as a service, application as a service
and platform as a service. Commodity multi-core, cloud and cluster
computing are the options on the hardware but the software requires
re-architecture using from coarse-grain to fine-grain parallelism. The
impressive runtimes attracted attendees to our talk. The dataflow
computational model solved their data mining problems otherwise
constrained by heap size. And the DataRush framework facilitated
fine-grain parallelism for rapid algorithm development even for those
without parallel programming experience.
Overall at KDD09, Parallel Data Mining and scalable algorithms
received special attention with a total of 13 talks (9.3% of talks)
dedicated to this topic. The common thread was the need to
efficiently mine gigabytes and terabytes of data in a timely fashion.
It is time to have Industry and Research Track Sessions
dedicated to High Performance and Parallel Data Mining!
REFS: 13 KDD09 papers on scalable & parallel DM
1. Pervasive Parallelism in Data Mining: Dataflow solution to Co-clustering Large and Sparse Netflix Data (Srivatsava Daruru, Nena M Marin, Matt Walker, Joydeep Ghosh)
2. Demo D07 - SHIFTR: A Fast and Scalable System for Ad Hoc Sensemaking of Large Graphs (Duen Horng Chau, Aniket Kittur, Hanghang Tong, Christos Faloutsos, Jason I. Hong)
3. Parallel Community Detection on Large Networks with Propinquity Dynamics (Yuzhou Zhang, Jianyong Wang, Yi Wang, Lizhu Zhou)
4. Scalable Graph Clustering Using Stochastic Flows: Applications to Community Discovery (Venu Satuluri, srinivasan parthasarathy)
5. W02> SkyTree: Scalable Skyline Computation for Sensor Data (Jongwuk Lee, Seung-won Hwang)
6. W05> Scalable Clustering and Keyword Suggestion for Online Advertisements (Anton Schwaighofer, Joaquin QuiÒonero Candela, Thomas Borchert, Thore Graepel, Ralf Herbrich)
7. Scalable Pseudo-Likelihood Estimation in Hybrid Random Fields (Antonino Freno, Marco Gori, Edmondo Trentin)
8. BBM: Bayesian Browsing Model from Petabyte-scale Data (Chao Liu, Fan Guo, Christos Faloutsos)
9. Large-Scale Graph Mining Using Backbone Refinement Classes (Andreas Maunz, Christoph Helma, Stefan Kramer )
10. Social Influence Analysis in Large-scale Networks (Jie Tang, Jimeng Sun, Chi Wang, Zi Yang)
11. Large-Scale Behavioral Targeting (Ye Chen, John F. Canny, Dmitry Pavlov)
12. Mind the Gaps: Weighting the Unknown in Large-Scale One-Class Collaborative Filtering (Rong Pan, Martin Scholz)
13. Large-Scale Sparse Logistic Regression (Jun Liu, Jieping Ye, Jianhui Chen)