Analytics

Work Flow Management for Big Data: Guide to Airflow (part 3)

Data analytics has been playing a key role in the decision making process at various stages of the business in many industries. In this era of Big Data, the adoption level is going to ONLY increase day by day. It is really overwhelming to see all the Big Data technologies that are popping up every week to cater to various stages of the Big Data Solution implementations. With data being generated at a very fast pace at various sources (applications that automate the business processes), implementing solutions for use cases like “real time data ingestion from various sources”, “processing the data at different levels of the data ingestion” and “preparing the final data for analysis” become challenging. Especially, orchestrating, scheduling, managing and monitoring the pipelines is a very critical task for the Data platform to be stable and reliable. Also, due to the dynamic nature of the data sources, data inflow rate, data schema, processing needs, etc, the work flow management (pipeline generation/maintenance/monitoring) becomes even more challenging.

This is a three part series of which “Overview along with few architectural details of Airflow” was covered as part of the first part. The second part covered the Deployment options for Airflow in Production. This part (the last of the series) will cover the installation steps (with commands) for Airflow.

Part 3: Installation steps for Airflow and its dependencies

Install pip

Installation steps
1
2
sudo yum install epel-release
sudo yum install python-pip python-wheel 

Install Erlang

Installation steps
1
2
sudo yum install wxGTK
sudo yum install erlang

RabbitMQ

Installation steps
1
2
wget https://www.rabbitmq.com/releases/rabbitmq-server/v3.6.2/rabbitmq-server-3.6.2-1.noarch.rpm
sudo yum install rabbitmq-server-3.6.2-1.noarch.rpm

Celery

Installation steps
1
pip install celery

Airflow:Pre-requisites

Installation steps
1
sudo yum install gcc-gfortran libgfortran numpy redhat-rpm-config python-devel gcc-c++

Airflow

Installation steps
1
2
3
4
5
6
# create a home directory for airflow
mkdir ~/airflow
 # export the location to AIRFLOW_HOME variable
export AIRFLOW_HOME=~/airflow
pip install airflow

Initialize the Airflow database

Installation steps
1
airflow initdb

By default Airflow installs with SQLLite DB. Above step would create a airflow.cfg file within “$AIRFLOW_HOME”/ directory. Once this is done, you may want to change the Repository database to some well known (Highly Available) relations database like “MySQL”, Postgress etc. Then reinitialize the database (using airflow initdb command). That would create all the required tables for airflow in the relational database.

Start the Airflow components

Installation steps
1
2
3
4
5
6
# Start the Scheduler
airflow scheduler
# Start the Webserver
airflow webserver
# Start the Worker
airflow worker

Below are few important Configuration points in airflow.cfg file

 

Here we conclude the series and would be updating the part 3 with more steps as and when possible.