Tuning the Approach

In a previous post, I said I’d talk about how to tune your dashboard without doing a bunch of analysis the hard way.

As with the dashboard, it may not help you to know what works for my business, but there are some common themes, and I’ll dig into those here.

Know Your Goal

This is the hard part: what do you need to do? You can’t tune toward anything until you have a goal in mind.

You need to clearly articulate your goal and the data set where you think the key information is. I like to do this Mad-Libs style:

I need     [what you need to see]
so that I can      [what you will learn when you see it].
I’ll know it’s working when      [the goal you want to achieve] .

Here are some examples:

I need student data with roster, grade, continuation, and graduation rates so that I can identify early math instructors with the greatest influence over retention. I’ll know it’s working when I can show that retention is improved when using this information vs. naive roster assignment. 

Or:

I need order data with customer, date, and SKUs so that I can identify relationships with items in semantically distant categories for our recommendation feature. I’ll know it’s working when out-of-category click-throughs for recommendation panel move by 1.5% or more.

Use The Right Tools

The variety of places you might have useful data can be intimidating. You don’t need to be afraid of this. Just because you can’t predict what will be useful doesn’t mean your data are useless. You can use techniques like clustering to turn big groups of data into small groups or even single points. You can use supervised or unsupervised machine learning techniques to let your system tell you where to look.

You can use tools developed for companies with petabyte-scale data. They work in any company. They’re free and they’re good at this kind of stuff. Let them do the heavy lifting so you can focus on outcomes. Not everything starts with an elaborate theory and proceeds through a carefully constructed experiment.

That being said, you’re probably going to get the best results with a theory and a carefully constructed experiment.

Iterate

The topic of experiment design is too big to cover here, but the takeaway for Big Data in business is that you’re never done. You set out to answer a question and it generates more questions. You get even more data, run more preprocessing, answer deeper questions, get more answers and… ask more questions. But since you’re clear on your goal and how to measure success, things improve over time. This is not a defect. That’s how you know it’s working.

 

 

 

 

 

 

 

 

The World of Big Data – Perspectives and Predictions for 2014

2013 saw a lot of action in the big data space, both from a technology and application perspective. Almost everyone we talk to has a big data project underway or coming soon. The projects range from cross-system reports to adopting Hadoop and building applications on top of it. While there are a lot of challenges still ahead, it is very promising to see how much momentum big data initiatives are gaining.

At the same time there is still a certain lack of knowledge or trust when it comes to this space. Some of the questions I hear are: “Do we have big data? Don’t only Facebook and Google have big data? Why should I use Hadoop or NoSQL?”

While I certainly agree that not everybody on the planet works at Petabyte-scale, this question can sometimes miss the point behind the big data revolution.

Let me use an example to explain why. In the software development space a few years ago, there was a huge wave of interest in Ruby and Ruby on Rails. The theme seemed to be that in the future, everyone would build everything in Ruby. Clearly that hasn’t happened, but that doesn’t mean it was a failure. Ruby was successful not for its conquest of the language landscape, but as a mindset. That mindset of easy-to-learn, easy-to-use productivity led to more folks being polyglots. It drove the emergence of tools that focus on developer productivity as a key goal and made people expect their language to make them productive.

In the same way, “Big Data” no longer means “more data than anyone has ever dealt with before.” It is a new way of approaching problems. It’s a mindset that looks beyond a one-size-fits-all relational database to a world where structured, unstructured, semi-structured, user generated, and system generated data can and should all work together.

I thought that a good way to kick off 2014 would be to make some predictions for the big data ecosystem for the upcoming new year.

#1 Better Real Time Querying on Hadoop

RTQ has been a shortcoming of the traditionally batch-oriented Hadoop platform. Technologies like Hive, Pig, Lingual, and a handful of commercial products were available, but all of them were still deeply rooted in the world of Map Reduce and results of ad hoc analysis could take hours.

This drove a need for technologies that let analysts interact with the rich data that organizations are aggregating in Hadoop. Cloudera was the leader in this space with the development of Impala, but MapR has Stinger, Apache has Drill and there are others. In addition, we’re seeing developments like Twitter releasing Storm. This reflects expanded interest in not only real-time querying by analysts, but real-time processing by applications.

2014 will be the year where RTQ becomes mainstream and robust for general adoption.

#2 Operationalization of Big Data Efforts

While a lot of organizations have dabbled in solutions based on Hadoop, we have seen more proof-of-concept projects than real world solutions. Finding people who have the technical know-how to make these projects work will remain challenging, but 2014 will be the tipping point for enterprises to start integrating their big data initiatives into the core of their business.

#3 Increasing convergence and alliance between cloud computing and big data solutions

It is abundantly clear that the ideas of virtualization, elasticity, automation  the words that we associate with “cloud”  are here to stay and will quickly move into the big data space. Amazon’s Elastic Map Reduce is a perfect example of a tool designed to put big data in the cloud, and the lessons and capabilities learned there will be useful whether the architecture stays in someone else’s leased cloud or migrates to in-house virtual environments.

 

For all of these predictions, as with Ruby, the implementation tool isn’t really the important part here: You’re going to see the tools used to do this get expanded, refined, and maybe replaced altogether. Some of those changes will stick, and others will just be fads. Successful organizations will win by having a clear focus on what they want their data to do for them, then learning and adopting the best tools for their job.

Ultimately, the future belongs to those who are able to:

  • Capture and unite various points of data from inside and outside the organization
  • Marry the data with analytics that can guide the strategy of the organization

In future blog posts we will go over our technology choices and the rationale powering them. We will also discuss our experiences with Real Time Querying in Hadoop, Data Pipelines, Cloud computing, and the other trends that are shaping our industry.

Feel free to reach out to me at shekhar@bluecanarydata.com if you are interested in talking about some problems you have or if you just want to bounce some ideas off us.

Happy Holidays and a Happy New Year!!