Pervasive DataRush

This blog is syndicated from the Pervasive DataRush site.

July 2010 - Posts

  • Pervasive DataRush: Cost-effective security for companies in a challenging economic climate

    Security spending in a downturn is under tight scrutiny. PricewaterhouseCoopers found this to be the case when it surveyed 7,200 executives in over 130 countries for its 2010 “Trial by Fire” report. One of the report’s primary findings states: 

    Not surprisingly, security spending is under pressure. Most executives are eyeing strategies to cancel, defer or downsize security-related initiatives.
    Source: PricewaterhouseCoopers 2010 “Trial by Fire” report.

    PricewatershouseCoopers notes that 70% of survey respondents think it is important to consider canceling, deferring or downsizing security-related initiatives if they require capital expenditures while 71% respond similarly for initiatives requiring working expenditures.At the same time, survey respondents overwhelmingly said they considered security strategies to be important, including increasing the focus on data protection, prioritizing security investments on risk, reducing or mitigating major risks and accelerating the adoption of security-related automation technologies to increase efficiencies and reduce cost. With many analysts predicting a slow recovery, the pressure to cut or curb security spending could increase. Meanwhile, security intrusions continue, likely at a more sophisticated scale (consider “Ghostnet” for example).

    So, are there any cost-effective solutions available to meet growing security? Yes.The massively parallel-processing horsepower of the Pervasive DataRush™ data processing engine combined with our matching capabilities forms a powerful solution for translating massive amounts of raw data into actionable intelligence. Combined with Pervasive DataMatcher™, a cost-effective robust, innovative solution is available to financial, insurance, healthcare, law enforcement and homeland security organizations that want to leverage powerful, next-generation analytics for detecting fraud and corruption, complying with anti-money laundering controls, and security and compliance monitoring.

    The Proof is in the Pudding: MalStone B-10 Benchmark

    This year Pervasive DataRush conducted internal testing using the MalStone B. The benchmark examines large volumes of logfiles to look for anomalies that might signal intrusions or attempted intrusions.  

    The data file for MalStoneB is generated by a Python script and the MalStone records have the format:

    Event ID | Timestamp | Site ID | Compromise Flag | Entity ID

    These describe a visit by an entity to a site at a particular time. After the visit, the entity sometimes becomes compromised, which is indicated by setting the compromise flag to 1. Each record is 100 bytes.  MalStoneB computes a ratio for each week d, and computes for each site w, and for all entities that visited the site at week d or earlier, the percent of visits for which the entity became compromised at any time between the visit and the end of the week d.The Malstone benchmark can use a variable sized dataset, In the experiment a 10 billion row dataset totaling 1 terabyte was used.

    Summary of Details:

    • 32 core, 4 socket, 2.0 Ghz Intel Xeon X7550
    • 1890 seconds (32.5 minutes)
    • 5.29 million rows/sec, approx 509 Mbytes/sec

    The Result: In just 31.5 minutes, 10 billion records were searched for anomalies using Pervasive DataRush. This result is 26x faster than its competition.

    Pervasive DataRush can enable applications that scale seamlessly on a single multicore server (rather than a cluster) to prepare or analyze even massive datasets – at unprecedented speeds.  Cost-effective approaches to daunting security challenges – that’s what IT executives seek in the midst of a lingering downturn, and that’s what we want to give them. 

  • Distributed, Scalable Clustering for Detecting Halos in Terascale Astronomical Datasets.


    The process of stellar discovery has long made its home at High Performance Computing (HPC) systems.
    HPC systems have evolved into clusters of "fat" multicore nodes. Applications must take advantage
    of parallelism across nodes and at the node level to maximize scalability and performance/watt.
    The complexity of multicore programming underscores the need for powerful and efficient runtime
    systems that manage resources such as cores, threads, memory, and communication sub-systems on behalf
    of the application. Dataflow is the computational model in Pervasive Datarush to construct
    efficient data-parallel pipelines via threads while abstracting the complexity of multicore programming.

    Simulation of the dynamical evolution of the entire observable universe via N-Body interactions
    begins with the presumed first principles of the universe: cosmic background radiation; an expanding
    volume of cooling helium and hydrogen; dark matter separating from gas and coalescing into massive
    stars. The classical N-body problem simulates the evolution of a system of N bodies, where the force
    exerted on each body arises due to its interaction with all the other bodies in the system. N-body
    algorithms have numerous applications in areas such as astrophysics, molecular dynamics and plasma
    physics. The Cube3PM method for carrying out large N-Body simulations to study formation and evolution
    of the large scale structure in the universe combines direct particle-particle forces at small scales
    with particle-mesh ones at larger scales (Particle-Particle-Particle-Mesh Method). Such an approach
    produces datasets with 4000^3-5488^3 (64-165 billion) particles. Several such simulations were
    completed on Ranger Cluster (Texas Advanced Computing Center) on 4,000-22,976 cores.

    From the astrophysicist’s perspective, the problem of identifying regions of interest in this terascale
    data and being able to visualize these regions is made intractable by the overwhelming volumes of data.
    Current methods to detect individual halos fall into two basic categories, namely friends-of-friends (FOF)
    and spherical over density (SO) methods. The FOF method is particle-based. Dense regions are identified
    by locating particles that are closer to each other than a pre-defined distance, which is a parameter of
    the model and is usually referred to as 'linking length' . Particles that are within that distance from
    each other are called 'friends', and the halos produced consist of all particles which are connected by
    a chain of friends. The SO class of methods, on the other hand, start by identifying the local density
    peaks (or gravitational potential minima) as the halo centers and then expand spherical shells around
    those centers until a pre-defined density threshold (a free parameter of the model picked based on
    dynamical considerations) is crossed. Within these types of methods there are multiple variations,
    regarding e.g. how the halo centers are located, how the gravitationally-unbound particles are treated,
    etc. Each of the two basic approaches, FOF and SO, has its advantages and drawbacks and can fail in
    certain situations (Tinker et al.).

    "Automated methods for halo identification and visualization are critical to advancing the physical
    understanding of what is happening through better analysis", said Astronomy Centre at the
    University of Sussex, UK.

    Our dataflow methods supply an alternative to the current approaches which on one hand is
    density-based like the SO, but does not make assumptions about the halo shapes as the SO does.

    This dataflow implementation distributes itself across multiple nodes on the Longhorn cluster.
    Likewise, this parallelized dataflow AutoHDS facilitates the use of large number of cores on a
    single cheap machine instead of expensive super computers. Experiments revealed that when data
    points were uniformly distributed across partitions, dataflow AutoHDS achieved linear speed up
    with the increase in the number of machines used. Dataflow AutoHDS also yields better performance
    with increasing data volumes.  In comparisons against Hadoop AutoHDS, dataflow was consistently
    faster on fewer resources.

    This work has been submitted for publication.  Coming here soon....

More Posts