Analytics

Microsoft and big data

It wasn’t that long ago there were no great options for running Hadoop on Windows. Not only that, Microsoft was exploring a massive parallel platform they called Dryad. It was to be their answer to Hadoop. Ultimately they dropped that plan and committed to Hadoop. Collaborating with Hortonworks, they made Hadoop on Windows a real thing. Now, running Hadoop on a Windows cluster or in the cloud in Azure is a viable option. HDInsight HDInsight is the branding for Microsoft’s Hadoop as a service in its Azure cloud platform.  Other Hadoop based projects such as Pig, Hive and Oozie are available as part of HDInsight as well. If you’re more familiar with Amazon AWS than Azure, another way to see HDInsight is that it is similar to Elastic MapReduce. It does remain very much Hadoop – it’s still all Java – MapReduce jobs can still be written in Java. Which…

Cloud Computing

Developing a Bot Army

This may be a little different territory for this blog since it’s not directly related to data analytics. But, it still inhabits the same universe and I thought it was interesting enough to share. Over the last few months I had an opportunity to work on a project that took a unique approach to data acquisition. The data that we were interested in came from various sources. However, it was not always available from feeds or dumps or even from APIs. A good portion of the data was only accessible by visiting a website, submitting a search form and then drilling down through the results. Since we were going after a lot of data, this would have to be automated in some fashion. Basically what we needed to set up was a scalable screen scraping operation. Luckily, there is a fair amount of technology available to accomplish just this. The…