Data analytics has been playing a key role in the decision making process at various stages of the business in many industries. In this era of Big Data, the adoption level is going to ONLY increase day by day. It is really overwhelming to see all the Big Data technologies that are popping up every week to cater to various stages of the Big Data Solution implementations. With data being generated at a very fast pace at various sources (applications that automate the business processes), implementing solutions for use cases like “real time data ingestion from various sources”, “processing the data at different levels of the data ingestion” and “preparing the final data for analysis” become challenging. Especially, orchestrating, scheduling, managing and monitoring the pipelines is a very critical task for the Data platform to be stable and reliable. Also, due to the dynamic nature of the data sources, data inflow rate, data schema, processing needs, etc, the work flow management (pipeline generation/maintenance/monitoring) becomes even more challenging.
This is a three part series of which “Overview along with few architectural details of Airflow” was covered as part of the first part. The second part covered the Deployment options for Airflow in Production. This part (the last of the series) will cover the installation steps (with commands) for Airflow.
Part 3: Installation steps for Airflow and its dependencies
Install pip
1
2
|
sudo yum install epel-release sudo yum install python-pip python-wheel |
Install Erlang
1
2
|
sudo yum install wxGTK sudo yum install erlang |
RabbitMQ
1
2
|
wget https: //www .rabbitmq.com /releases/rabbitmq-server/v3 .6.2 /rabbitmq-server-3 .6.2-1.noarch.rpm sudo yum install rabbitmq-server-3.6.2-1.noarch.rpm |
Celery
1
|
pip install celery |
Airflow:Pre-requisites
1
|
sudo yum install gcc-gfortran libgfortran numpy redhat-rpm-config python-devel gcc-c++ |
Airflow
1
2
3
4
5
6
|
# create a home directory for airflow mkdir ~ /airflow # export the location to AIRFLOW_HOME variable export AIRFLOW_HOME=~ /airflow pip install airflow |
Initialize the Airflow database
1
|
airflow initdb |
By default Airflow installs with SQLLite DB. Above step would create a airflow.cfg file within “$AIRFLOW_HOME”/ directory. Once this is done, you may want to change the Repository database to some well known (Highly Available) relations database like “MySQL”, Postgress etc. Then reinitialize the database (using airflow initdb command). That would create all the required tables for airflow in the relational database.
Start the Airflow components
1
2
3
4
5
6
|
# Start the Scheduler airflow scheduler # Start the Webserver airflow webserver # Start the Worker airflow worker |
Below are few important Configuration points in airflow.cfg file
- dags_folder = /root/airflow/dags
- The folder where your airflow pipelines live
- executor = LocalExecutor
- The executor class that airflow should use.
- sql_alchemy_conn = mysql://root:root@localhost/airflow
- The SqlAlchemy connection string to the metadata database.
- base_url = http://localhost:8080
- The hostname and port at which the Airflow webserver runs
- broker_url = sqla+mysql://root:root@localhost:3306/airflow
- The Celery broker URL. Celery supports RabbitMQ, Redis and experimentally a sqlalchemy database
- celery_result_backend = db+mysql://root:root@localhost:3306/airflow
- A key Celery setting that determines the location of where the workers write the results to
Here we conclude the series and would be updating the part 3 with more steps as and when possible.