Analytics

Work Flow Management for Big Data: Guide to Airflow (part 1)

Data analytics has been playing a key role in the decision making process at various stages of the business in many industries. In this era of Big Data, the adoption level is going to ONLY increase day by day. It is really overwhelming to see all the Big Data technologies that are popping up every week to cater to various stages of the Big Data Solution implementations. With data being generated at a very fast pace at various sources (applications that automate the business processes), implementing solutions for use cases like “real time data ingestion from various sources”, “processing the data at different levels of the data ingestion” and “preparing the final data for analysis” become challenging. Especially, orchestrating, scheduling, managing and monitoring the pipelines is a very critical task for the Data platform to be stable and reliable. Also, due to the dynamic nature of the data sources, data inflow rate,…

Analytics

How To Rebuild Cloudera’s Spark

As a followup to the post How to upgrade Spark on CDH5.5, I will show you how to get a build environment up and running with a CentOS 7 virtual machine running via Vagrant and Virtual Box. This will allow for the quick build or rebuild of Cloudera’s version of Apache Spark from https://github.com/cloudera/spark. Why? You may want to rebuild Cloudera’s Spark in the event that you want to add functionality that was not compiled in by default. The Thriftserver and SparkR are two things that Cloudera does not ship (nor support), so if you are looking for these things, these instructions will help. Using a disposable virtual machine will allow for a repeatable build and will keep your workstation computing environment clean of all the bits that may get installed. Requirements You will need Internet access during the installation and compilation of the Spark software. Make sure that you…