Airflow

Making Apache Airflow Highly Available

In a previous post, we discussed Setting up an Apache Airflow Cluster. In this post we’ll talk about the shortcomings of a typical Apache Airflow Cluster and what can be done to provide a Highly Available Airflow Cluster. A Typical Apache Airflow Cluster In a typical multi-node Airflow cluster you can separate out all the major processes onto separate machines. Here are the main processes: Web Server A daemon which accepts HTTP requests and allows you to interact with Airflow via a Python Flask Web Application. It provides the ability to pause, unpause DAGs, manually trigger DAGs, view running DAGs, restart failed DAGs and much more. Scheduler A daemon which periodically polls to determine if any registered DAG and/or Task Instances needs to triggered based off its schedule. Executors/Workers A daemon that handles starting up and managing 1 to many CeleryD processes to execute the desired tasks of a particular DAG. High Availability in a…

Analytics

Work Flow Management for Big Data: Guide to Airflow (part 1)

Data analytics has been playing a key role in the decision making process at various stages of the business in many industries. In this era of Big Data, the adoption level is going to ONLY increase day by day. It is really overwhelming to see all the Big Data technologies that are popping up every week to cater to various stages of the Big Data Solution implementations. With data being generated at a very fast pace at various sources (applications that automate the business processes), implementing solutions for use cases like “real time data ingestion from various sources”, “processing the data at different levels of the data ingestion” and “preparing the final data for analysis” become challenging. Especially, orchestrating, scheduling, managing and monitoring the pipelines is a very critical task for the Data platform to be stable and reliable. Also, due to the dynamic nature of the data sources, data inflow rate,…