Analytics

Microsoft and big data

It wasn’t that long ago there were no great options for running Hadoop on Windows. Not only that, Microsoft was exploring a massive parallel platform they called Dryad. It was to be their answer to Hadoop. Ultimately they dropped that plan and committed to Hadoop. Collaborating with Hortonworks, they made Hadoop on Windows a real thing. Now, running Hadoop on a Windows cluster or in the cloud in Azure is a viable option. HDInsight HDInsight is the branding for Microsoft’s Hadoop as a service in its Azure cloud platform.  Other Hadoop based projects such as Pig, Hive and Oozie are available as part of HDInsight as well. If you’re more familiar with Amazon AWS than Azure, another way to see HDInsight is that it is similar to Elastic MapReduce. It does remain very much Hadoop – it’s still all Java – MapReduce jobs can still be written in Java. Which…

Analytics

Hive For Un-Structured Data

The Hadoop ecosystem today is very rich and growing. A technology that I use and enjoy quite a bit in that ecosystem is Hive. From the Hive wiki, Hive is “designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data”. To add to that statement, Hive is also an abstraction built on top of Map Reduce that lets you express data processing using a SQL-like syntax described in detail here. Hive reduces the need to deeply understand the Map Reduce paradigm and allows developers and analysts to apply existing knowledge of SQL to big data processing. It also makes expressing Map Reduce jobs more declarative. One thing I do hear a lot from folks is that Hive, being schema driven and having typed columns, is only fit for processing structured and row oriented tabular data. Although this seems like a logical conclusion, it is very…