Pervasive DataRush

This blog is syndicated from the Pervasive DataRush site.

Distributed, Scalable Clustering for Detecting Halos in Terascale Astronomical Datasets.


The process of stellar discovery has long made its home at High Performance Computing (HPC) systems.
HPC systems have evolved into clusters of "fat" multicore nodes. Applications must take advantage
of parallelism across nodes and at the node level to maximize scalability and performance/watt.
The complexity of multicore programming underscores the need for powerful and efficient runtime
systems that manage resources such as cores, threads, memory, and communication sub-systems on behalf
of the application. Dataflow is the computational model in Pervasive Datarush to construct
efficient data-parallel pipelines via threads while abstracting the complexity of multicore programming.

Simulation of the dynamical evolution of the entire observable universe via N-Body interactions
begins with the presumed first principles of the universe: cosmic background radiation; an expanding
volume of cooling helium and hydrogen; dark matter separating from gas and coalescing into massive
stars. The classical N-body problem simulates the evolution of a system of N bodies, where the force
exerted on each body arises due to its interaction with all the other bodies in the system. N-body
algorithms have numerous applications in areas such as astrophysics, molecular dynamics and plasma
physics. The Cube3PM method for carrying out large N-Body simulations to study formation and evolution
of the large scale structure in the universe combines direct particle-particle forces at small scales
with particle-mesh ones at larger scales (Particle-Particle-Particle-Mesh Method). Such an approach
produces datasets with 4000^3-5488^3 (64-165 billion) particles. Several such simulations were
completed on Ranger Cluster (Texas Advanced Computing Center) on 4,000-22,976 cores.

From the astrophysicist’s perspective, the problem of identifying regions of interest in this terascale
data and being able to visualize these regions is made intractable by the overwhelming volumes of data.
Current methods to detect individual halos fall into two basic categories, namely friends-of-friends (FOF)
and spherical over density (SO) methods. The FOF method is particle-based. Dense regions are identified
by locating particles that are closer to each other than a pre-defined distance, which is a parameter of
the model and is usually referred to as 'linking length' . Particles that are within that distance from
each other are called 'friends', and the halos produced consist of all particles which are connected by
a chain of friends. The SO class of methods, on the other hand, start by identifying the local density
peaks (or gravitational potential minima) as the halo centers and then expand spherical shells around
those centers until a pre-defined density threshold (a free parameter of the model picked based on
dynamical considerations) is crossed. Within these types of methods there are multiple variations,
regarding e.g. how the halo centers are located, how the gravitationally-unbound particles are treated,
etc. Each of the two basic approaches, FOF and SO, has its advantages and drawbacks and can fail in
certain situations (Tinker et al.).

"Automated methods for halo identification and visualization are critical to advancing the physical
understanding of what is happening through better analysis", said Astronomy Centre at the
University of Sussex, UK.

Our dataflow methods supply an alternative to the current approaches which on one hand is
density-based like the SO, but does not make assumptions about the halo shapes as the SO does.

This dataflow implementation distributes itself across multiple nodes on the Longhorn cluster.
Likewise, this parallelized dataflow AutoHDS facilitates the use of large number of cores on a
single cheap machine instead of expensive super computers. Experiments revealed that when data
points were uniformly distributed across partitions, dataflow AutoHDS achieved linear speed up
with the increase in the number of machines used. Dataflow AutoHDS also yields better performance
with increasing data volumes.  In comparisons against Hadoop AutoHDS, dataflow was consistently
faster on fewer resources.

This work has been submitted for publication.  Coming here soon....

Comments

Twitter Trackbacks for Distributed, Scalable Clustering for Detecting Halos in Terascale Astronomical Datasets. - Pervasive DataRush [pervasive.com] on Topsy.com said:

Pingback from  Twitter Trackbacks for                 Distributed, Scalable Clustering for Detecting Halos in Terascale Astronomical Datasets. - Pervasive DataRush         [pervasive.com]        on Topsy.com

July 8, 2010 10:25 AM

About n5712036

Dr.Nena M. Marín joined Pervasive Software Innovations Laboratories (iLabs) in September of 2008. Her research efforts in iLabs focus on Parallel Data Mining algorithms and their applications in business and science. Dr. Marín’s research interests include data intensive high performance computing, mathematical modeling and simulations of physiological systems, spectral pattern recognition for disease detection and drug delivery, bioformatics and Monte Carlo simulations in tissue photonics. Her most recent industry research interests include patterns in large scale and sparse datasets, clustering and unsupervised learning, collaborative filtering recommender systems and Marketing and Sales Optimization Churn analysis. Dr. Marín’s most recent work entitled “Pervasive Parallelism in Data Mining: Dataflow Solution to Co-clustering Large and Sparse Netflix Data” has been selected for presentation at the Knowledge Discovery and Data Mining (KDD) Conference July 2009 in Paris, Fr. She leads collaborations with Academic Partners focusing on bringing the power of commodity multi-core and parallel architecture into the hands of researchers to accelerate delivery of science. Dr. Marín is a National Science Foundation Fellow. After attaining both a Bachelor of Science Degree in Mechanical Engineering in 1984 and a Masters Degree in Mechanical Engineering in 1995, at the University of Texas at Austin, Dr. Marin was bestowed her Ph.D. in Biomedical Engineering at the University of Texas at Austin in 2005. Her Ph.D. research was funded by the National Institute of Health Program and focused on pattern recognition and automated data mining algorithms for cervical cancer detection. Dr, Marín worked as part of a multidisciplinary team in a Phase II Clinical Trial conducted at M.D. Anderson Cancer Center and the British Columbia Cancer Center in Vancouver, Canada.