Attending the O'Reilly Strata Conference, I received lots of food for thought about the future of Big Data, as well as further validation that Pervasive DataRushTM is a good framework to respond to many of the information explosion challenges now or soon to be facing organizations. Here are some of my takeaways from this insightful event.
Manager, Product Marketing
Information is Black Gold
Metamarkets CTO Mike Driscoll told technology executives to think ‘oil' when it comes to information. Driscoll, quoting Gartner's Peter Sondergaard, stresses, "Information will be the ‘oil of the 21st century'. It will be the resource running our economy in ways not possible in the past." Driscoll describes Big Data as the ‘tar sands' of the information economy, containing valuable stores of information, but that are expensive to extract. Once extracted, the challenge is to analyze the data, using it to learn and predict.
Driscoll sees three major forces driving Big Data:
- Ubiquitous sensor networks (mobile phones, as an example).
- Cloud computing obviating the need to manage compute power. Drawing an analogy to an electric grid, Driscoll says that businesses don't invest capital in power generation, and the cloud enables a similar trend for compute power.
- Machine learning, with Driscoll citing the progress made in the DARPA grand challenge and the Netflix prize.
Other emerging trends Driscoll has a pulse on are:
- The Need for Data Scientists: Already in short supply, the demand for data scientists is growing. Companies are looking for those with interdisciplinary skills in math, statistics, bioinformatics, physics, programming (and hacking) skills, and, above all, curiosity. In fact, many speakers ended their presentations with a message to data scientists: We're hiring.
- The Rise of Data Publishers (i.e., the reassertion of control by data producers): Companies recognize the value of their own data and are pulling back from third-party data processors.
- The End of Privacy (or the Rethinking of Privacy): The view that visibility of personal data can be restricted is shifting to one inclined to restricting allowed usage of that data. In other words, policing usage will become more prevalent.
- The Rise of Data Start-ups: A class of companies is emerging whose supply chains consist of nothing but data. Their inputs are collected through partnerships or from publicly available sources, processed, and transformed into traffic predictions, news aggregations, or real estate valuations. Data start-ups are the wildcatters of the information age, searching for opportunities across the data landscape.
Data science, Driscoll firmly believes, can solve big problems to organizations-namely, making sense of the world and scaling-up decision making. As a case in point, he cites the use of data mining to reduce health care costs by identifying the neediest patients and improving their health care.
Read more of Driscoll's commentary.
Traditional BI and Applications are Complimentary
Dr. Barry Devlin of 9sight Consulting, an industry founder of the DW industry with over 30 years in DW and BI, suggests that Traditional BI (and its database-centric approach), with its emphasis on consistency, traceability, and data quality, and Applications, built on technologies like Hadoop and MapReduce, with support for large, rapidly changing datasets, are complementary approaches to handling the information explosion organizations face.
I found it interesting that Dr. Devlin presented Traditional BI and Applications as two worlds which need to-and can-work together. Our product Pervasive DataRush works in both spaces-on the traditional BI side, it's the basis for our Pervasive DataMatcher and Pervasive Data Profiler products and for applications, Pervasive DataRush integrates with Hadoop, accelerating MapReduce jobs up to 10x.
Take a look at Dr. Devlin's slide presentation.
The Multicore Crisis and Emerging Technologies
One of the highlights of Strata was listening to Third Nature President Mark Madsen's survey of new technologies, particularly technology innovations and systems powering the analytic database landscape today. Madsen underscored the multicore crisis and the end of Moore's law free lunch as major factors in shaping data technology.
Pre-2005, Madsen says, the trend was for CPU manufacturers to increase clock rates with every new chip, and everyone's software would automatically run faster. But increasing speed also increases power consumption and heat generated. As a remedy, CPU makers moved towards putting multiple cores on a chip. Madsen, however, points out, "Putting more engines in your car doesn't make it go faster; you need to redesign to take advantage of them. Achieving multicore performance is fundamentally different than getting a free boost from clock rates increasing." I couldn't agree more!
Companies operating at petabyte scale like Google, EBay and Twitter are the exception to the norm, Madsen states. Most companies in need of Big Data Analytics have less than six terabytes of data, and he finds that the computational needs of data analytics are pushing companies from running on PCs into SMP servers and clusters. I would add that Pervasive DataRush can help here.
Mark was throwing out insights faster than I could write them down... thankfully, his slides are available. I recommend taking a more in-depth look. I know I will.