#itsnotbigdata

itsnotbigdataI’m taking a big step with my social networking persona…I’m starting a hashtag.  Do I have to register it with ICANN?  Biz Stone?  Jimmy Fallon & Justin Timberlake?  The new hashtag is #itsnotbigdata.

My reasoning is this — big data (or Big Data or “Big Data”) is at the peak of inflated expectations on the Gartner Hype Cycle.  That means that every blogger and her brother is using the term so that it’ll garner more hits on the interwebs.  Problem is, it’s not always used accurately and consistently.  Now I like data…I like technology…I like software…but I don’t like when buzzwordy terms get thrown around haphazardly with no regard for the downstream effects.  And what are the downstream effects?  It’s article after article incorrectly utilizing the term Big Data thereby propogating the misuse for future researchers.  It citogenesis all over again!

I don’t have my own pet definition of Big Data.  I agree that it means you can’t view the data in Excel or Notepad.  I like it that Big Data implies you’ve created some new tool/application/technique to analyze data (not just using Oracle or MS SQL Server).  I’m luke warm on the “3 V’s” definition.  Regardless, it’s pretty clear what ISN’T Big Data.  A database with 15 million records in it is not Big Data.  A log file from some website’s Apache server isn’t Big Data.  A predictive model that uses 100,000 customers for training data isn’t Big Data.  These are all problems that data folks have solved time and time again over the last couple of decades.  It’s called database management or analytics or even simply “reporting”.  Just because Big Data is a hot term, it doesn’t mean that everything you do with data is now Big Data.

I’m a huge fan of the maxim “right tool for the right job”.  It implies that you are not solving any problem with the same technique every time (“if all you have is a hammer, then everything becomes a nail”).  It also implies that you have put some thought into analyzing the problem at hand.  At Blue Canary, we tend to approach our data problems that way.  Do you know what the primary data store is for our higher education retention analytics product?  CSV files.  That’s right…plain old flat comma separated files.  Why?  Because we don’t need the layers that come along with database management.  We use Java programs and Python libraries to move/analyze data.  We just need to pass the data along from process to process.

Now, if the data started to get so broad/deep that flat files didn’t cut it, we’d use MySQL or some other data management tool to solve that problem.  If it got even bigger, we’d use Map Reduce/Hadoop to solve it.  We’ve used these techniques both appropriately and inappropriately and we’ve adjusted our approach.

Two final thoughts on Big Data.  First, I mean no disrespect to any articles I reference with my new hashtag.  I understand how business works and I appreciate that folks are taking the time to publicize the innovative uses of data.  Second, I give a tremendous amount of credit to the true Big Data pioneers who continually push the envelope, develop new tools, and share their findings with the rest of us.  Thanks for your innovation and dedication to the art.  #itsnotbigdata, but sometime it is #bigdata.