Pervasive DataRush

This blog is syndicated from the Pervasive DataRush site.

December 2008 - Posts

  • Pervasive Parallelism in Data Mining: Co-clustering of Netflix Data

    authored by:Srivatsava Daruru(1), Matt Walker(2), Nena Marín(2), Joydeep Ghosh(1)

    Abstract
    Building realistic models requires computationally expensive algorithms. Mining overwhelming volumes of historical data emphasizes the need for efficient implementations. The data versus algorithm trade off exists only if resources like memory, storage and processors are limited and/or algorithm implementation does not capitalize on these hardware resources.

    This research focuses on introducing a terascale “Pervasive Parallelism” Java framework called Pervasive DataRush™, capable of delivering sustained high performance across large scale Netflix datasets for recommender systems on commodity multicore; ours tested on up to 32 x 64-bit cores.

    Pervasive DataRush™ uses dataflow process networks to operate on very large data sets. The Pervasive DataRush Library is an extensive and customizable Java library of massively parallel programming components using dataflow principles. Components are even "self-composing", with late-binding facilities to dynamically adjust parallel execution strategies.

    In this paper, a Minimum Sum-Squared Residue Coclustering algorithm is applied to the Netflix Competition Dataset. Coclustering simultaneously clusters rows and columns to find natural patterns of customers and movies based on the customer movie ratings. Results demonstrate linear scaling across data and cores. Typical runtimes drop from 226 sec to 9 sec (Netflix 5%), demonstrating sustained performance and scaling from 1 to 32 cores. Additionally results show that this massively parallel two-dimensional k-means algorithm operates in a dataflow graph.

More Posts