2013 saw a lot of action in the big data space, both from a technology and application perspective. Almost everyone we talk to has a big data project underway or coming soon. The projects range from cross-system reports to adopting Hadoop and building applications on top of it. While there are a lot of challenges still ahead, it is very promising to see how much momentum big data initiatives are gaining.
At the same time there is still a certain lack of knowledge or trust when it comes to this space. Some of the questions I hear are: “Do we have big data? Don’t only Facebook and Google have big data? Why should I use Hadoop or NoSQL?”
While I certainly agree that not everybody on the planet works at Petabyte-scale, this question can sometimes miss the point behind the big data revolution.
Let me use an example to explain why. In the software development space a few years ago, there was a huge wave of interest in Ruby and Ruby on Rails. The theme seemed to be that in the future, everyone would build everything in Ruby. Clearly that hasn’t happened, but that doesn’t mean it was a failure. Ruby was successful not for its conquest of the language landscape, but as a mindset. That mindset of easy-to-learn, easy-to-use productivity led to more folks being polyglots. It drove the emergence of tools that focus on developer productivity as a key goal and made people expect their language to make them productive.
In the same way, “Big Data” no longer means “more data than anyone has ever dealt with before.” It is a new way of approaching problems. It’s a mindset that looks beyond a one-size-fits-all relational database to a world where structured, unstructured, semi-structured, user generated, and system generated data can and should all work together.
I thought that a good way to kick off 2014 would be to make some predictions for the big data ecosystem for the upcoming new year.
#1 Better Real Time Querying on Hadoop
RTQ has been a shortcoming of the traditionally batch-oriented Hadoop platform. Technologies like Hive, Pig, Lingual, and a handful of commercial products were available, but all of them were still deeply rooted in the world of Map Reduce and results of ad hoc analysis could take hours.
This drove a need for technologies that let analysts interact with the rich data that organizations are aggregating in Hadoop. Cloudera was the leader in this space with the development of Impala, but MapR has Stinger, Apache has Drill and there are others. In addition, we’re seeing developments like Twitter releasing Storm. This reflects expanded interest in not only real-time querying by analysts, but real-time processing by applications.
2014 will be the year where RTQ becomes mainstream and robust for general adoption.
#2 Operationalization of Big Data Efforts
While a lot of organizations have dabbled in solutions based on Hadoop, we have seen more proof-of-concept projects than real world solutions. Finding people who have the technical know-how to make these projects work will remain challenging, but 2014 will be the tipping point for enterprises to start integrating their big data initiatives into the core of their business.
#3 Increasing convergence and alliance between cloud computing and big data solutions
It is abundantly clear that the ideas of virtualization, elasticity, automation — the words that we associate with “cloud” — are here to stay and will quickly move into the big data space. Amazon’s Elastic Map Reduce is a perfect example of a tool designed to put big data in the cloud, and the lessons and capabilities learned there will be useful whether the architecture stays in someone else’s leased cloud or migrates to in-house virtual environments.
For all of these predictions, as with Ruby, the implementation tool isn’t really the important part here: You’re going to see the tools used to do this get expanded, refined, and maybe replaced altogether. Some of those changes will stick, and others will just be fads. Successful organizations will win by having a clear focus on what they want their data to do for them, then learning and adopting the best tools for their job.
Ultimately, the future belongs to those who are able to:
- Capture and unite various points of data from inside and outside the organization
- Marry the data with analytics that can guide the strategy of the organization
In future blog posts we will go over our technology choices and the rationale powering them. We will also discuss our experiences with Real Time Querying in Hadoop, Data Pipelines, Cloud computing, and the other trends that are shaping our industry.
Feel free to reach out to me at shekhar@bluecanarydata.com if you are interested in talking about some problems you have or if you just want to bounce some ideas off us.
Happy Holidays and a Happy New Year!!