Making Apache Airflow Highly Available

In a previous post, we discussed Setting up an Apache Airflow Cluster. In this post we’ll talk about the shortcomings of a typical Apache Airflow Cluster and what can be done to provide a Highly Available Airflow Cluster.

A Typical Apache Airflow Cluster

In a typical multi-node Airflow cluster you can separate out all the major processes onto separate machines. Here are the main processes:

Web Server

A daemon which accepts HTTP requests and allows you to interact with Airflow via a Python Flask Web Application. It provides the ability to pause, unpause DAGs, manually trigger DAGs, view running DAGs, restart failed DAGs and much more.


A daemon which periodically polls to determine if any registered DAG and/or Task Instances needs to triggered based off its schedule.


A daemon that handles starting up and managing 1 to many CeleryD processes to execute the desired tasks of a particular DAG.

High Availability in a Typical Apache Airflow Cluster

A typical cluster can provide a good amount of High Availability right off the bat. It does this by allowing for redundancy in most of the core processes listed above:

Web Server

You can have multiple Master Nodes with web servers running on them all load balanced. This means that if one of the masters goes down, then you have at least one other Master available to accept HTTP requests forwarded from the Load Balancer.


You can setup multiple Worker nodes. If one of those nodes were to go down, the others will still be active and able to accept and execute tasks.



Problems with the Typical Apache Airflow Cluster

The problem with the traditional Airflow Cluster setup is that there can’t be any redundancy in the Scheduler daemon. If you were to have multiple Scheduler instances running you could have multiple instances of a single task be scheduled to be executed. This can be a very bad thing depending on your jobs. For example, if you were to be running a workflow that performs some type of ETL process, you may end up seeing duplicate data that has been extracted from the original source, incorrect results from duplicate transformation processes, or duplicate data in the final source where data is loaded. So, in the case of setting up an Airflow cluster, you can only have a single Scheduler daemon running on the entire cluster. If this single Airflow Scheduler instance were to crash, your Airflow cluster won’t have any DAGs or tasks being scheduled.

The Solution

There isn’t a way in a plain distribution of Airflow to enable High Availability for the Scheduler. Instead what we did, at Clairvoyant, was to create a process that would allow for a Highly Available Scheduler instance which we call the Airflow Scheduler Failover Controller.

This process tries to ensure that there is always one and only one Scheduler instance running at a time. If one Scheduler instance dies, then the failover controller tries to start it back up again. If it still doesn’t startup on the original machine, it tries to start it up on another, trying to ensure that there’s at least one running in the cluster.

In addition, to prevent this process from becoming the one process that prevents the entire cluster from being highly available (because  if this processes dies then the scheduler will no longer be Highly Available), we also allow redundancy in the Scheduler Failover Controller. Once a Scheduler Failover Controller is selected as the ACTIVE instance and all others are listed in a STANDBY state until such a time when the active Failover Controller stops reporting in. Its recommended that you have the Scheduler Failover Controller running on the same machines as the machines you designate the Schedulers are running on.



How the Scheduler Failover Controller Works

There will ideally be multiple Scheduler Failover Controllers running. One that starts in an ACTIVE state, and at least one other thats is starts in a STANDBY state.

The ACTIVE Scheduler Failover Controller will regularly push a HEART BEAT into a metastore (Supported Metastore’s: MySQL DB, Zookeeper), which the STANDBY Scheduler Failover Controller will read from to see if it needs to become ACTIVE (if the last heart beat is too old, then the STANDBY Scheduler Failover Controller knows the ACTIVE instance is not running).

The ACTIVE Scheduler Failover Controller will poll every X seconds (default is 10 seconds but can be configured) to see if the Airflow Scheduler is running on the desired node. If it is not, the Scheduler Failover Controller will try to restart the daemon. If the Scheduler daemon still doesn’t startup, the Scheduler Failover Controller will attempt to start the Scheduler daemon on another master node in the cluster. As a part of this poll, the ACTIVE Scheduler Failover Controller will also check and make sure that Scheduler daemons aren’t running on the other nodes.

Setup Steps

  1. Setup Airflow on all the nodes you want to act in the cluster
  2. Configure Airflow to use CeleryExecutor
  3. Configure each Airflow instance to point to the same External MySQL instance and DB for sql_alchemy_conn and celery_result_backend properties
    • Its also recommended to follow steps to make MySQL, or whatever type of database you’re using, Highly Available too.
  4. If you’re using RabbitMQ as your Queueing Service, then set it up and to be Highly Available
    1. Setup RabbitMQ Cluster with HA
    2. Setup a Load Balancer for RabbitMQ
  5. Configure each Airflow instance to point to the same Queueing Service (set the broker_url argument)
  6. Deploy the Airflow Scheduler Failover Controller to all the nodes acting as Failover Controllers (same one acting as a Scheduler)
  7. Configure the Airflow Scheduler Failover Controller
  8. Setup a Load Balancer to balance requests between the the Nodes for the Web Server
    1. Port Forwarding
      1. Port 8080 (HTTP) → Port 8080 (HTTP)
    2. Health Check
      1. Protocol: HTTP
      2. Ping Port: 8080
      3. Ping Path: /
  9. Startup the Airflow services
    1. WebServer and Failover Controller instances to be started on the Master Nodes
    2. Worker instances to be started on the Worker Nodes
  10. Deploy your DAG to all Airflow instances DAG directory that’s acting as a Master Node


Imitation of Intelligence : Exploring Artificial Intelligence!

What is the difference between “calculate” and “compute”?

Light & flexible

I assure you, we are not going to discuss such quintessential terms related to computing world, which might bore some of us, as it might have given the impression 😀

But this is something out of curiosity about the crux of what we are going to go through.



So, the calculation involves an arithmetic process. Computation is involved in the implementation of non-arithmetic steps of the algorithm which actually brings things up to the calculation.

You got the idea where I am going with this right? We can try to visualize every aspect of data processing stages from data collection, cleansing, processing and then transforming it through mathematical operations to map data into something which makes more sense i.e. “Insight“. But the intelligence used for such meaningful transformation used to be the human intervention which now can be “Artificial” as per the new digital trend.

Getting to know …

Artificial Intelligence in the industry will change everything about the way we produce, manufacture and deliver. Cognitive computing, machine learning, natural language processing – different aspects have emerged as the development of the technology has progressed in recent years. But they all encapsulated the idea that machines could one day be taught to learn how to adapt by themselves, rather than having to be spoon-fed every instruction for every eventuality. There are certain important emerging digital trends we can track considering the technology & future that are together converging very fast. Years ago the industrial revolution immutably remolded society and another revolution is underway with potentially even further reaching consequences. These digital trends are all potentially disruptive unless we plan ahead for the impact and change that is coming. Likely things benefited will be more agility, smarter business processes, and better productivity by converging focus and efforts on right things.

Goals of Artificial Intelligence

Artificial intelligence (AI) has become ubiquitous in business in every industry where decision making is being fundamentally transformed by Machines brains. The need for faster and smarter decisions and the management of big data that can make the difference is what is driving this trend. The convergence of big data with AI is inevitable as the automation of smarter decision-making is the next evolution of big data. while adapting to this change some will inevitably prosper and some will fail. Those that manage to succeed are likely to be those which can manage to see beyond the hype and understand how this technology can add real value and drive positive change.

The best way to look at AI is automating things which have been worked on and implemented logically to solve the problem already. This will help to apply the existing problem-solving logic effectively, smartly using artificial brains and using human brain, efforts to focus on problems which still need more attention. The world where machines and devices all communicate with each other to get the work done (IoT), leaving us free to relax and enjoy life can be imagined through effective use of AI.

Let’s wonder around this digital trend and explore how can it be integrated for a better life 🙂

#bigdata is becoming the norm for many organizations, using it to profile people, analyze their behavioral patterns and inform their decision-making processes, whether that’s to determine a basic day-to-day life process or business strategic decisions.

#ArtificialIntelligence or #AI is stepping out of the world of science-fiction and into real life, providing the ‘thinking’ ability behind virtual personal assistants, automated robots, and smart cars.

#MachineLearning algorithms are providing intelligence for discovering patterns in the huge amount of data that traditional data analysis couldn’t hope to find, helping to detect fraud and diagnose diseases.

I have tried to interact with Google Home, AI application developed by Google and asked some questions to know information related to this trend, following is the conversation.

Me: What is Data Analytics?
Google Home: Data Analytics refers to qualitative and quantitative techniques and processes used to enhance productivity and business gain. Data is extracted and categorized to identify and analyze behavioral data and patterns, and techniques vary according to organizational requirements.

“Ok Google”

Me: What is Machine Learning?

Google Home: Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can change when exposed to new data. … Both systems search through data to look for patterns.

Me: What is Artificial Intelligence?
Google Home: the theory and development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages.

Well, that’s cool.

Artificial Intelligence – often classified into one of two fundamental groups – applied and general.
Applied AI is more common – systems designed to intelligently trade stocks and shares, or maneuver an autonomous vehicle would fall into this category.
Generalized AI – systems or devices which can, in theory, handle any task – are less common, but this is where some of the most exciting advancements are happening today. It is also the area that has led to the development of Machine Learning. Often referred to as a subset of AI, it’s really more accurate to think of it as the current state-of-the-art.

The relation between Artificial Intelligence and Machine Learning:

Artificial Intelligence, Human Intelligence exhibited by Machines, is the broader concept of machines being able to perform tasks which imitate human intelligence i.e artificial.
Machine Learning,  out of many other goals/approaches of AI, an approach to achieve Artificial Intelligence, is an application of AI revolving around the idea that let machines learn for themselves given access to information.

Deep Learning has enabled many practical applications of Machine Learning and in turn the overall field of AI. It breaks down tasks in ways that make all kinds of machine obliges seem possible, even likely.

Concept evolution!

As technology and understanding of how human minds work has progressed, our concept of what constitutes AI has changed. Rather than progressively complex calculations, work in the field of AI concentrated on imitating human decision-making processes and carrying out tasks in even more hominid ways. Being innovations have been in place, engineers realized that rather than training computers and machines, it would be far more efficient to code them to think and learn human brain and provide the internet as a learning platform to give them access to all of the information in the world.

To make computers to think and understand the world in the way we do, while retaining the innate advantages they hold over us such as speed, accuracy, and lack of bias, development of neural networks played the key role.

Going a step ahead to avoid this complexity of learning concepts of AI and algorithmic journey of ML, to provide with a platform to develop an AI application with simple logistic and freeing developer to focus on AI problem statement to solve is the next advancement.

Happy to see some leaders in the industry are taking interest in it and making complex technologies such as AI and ML available as a simple platform to create such voice/text assistant to address this perspective of data science.

Amazon Alexa

And many in the market. Such initiatives will be always appreciated.

About Google API.AI – Understand Google and build AI Assistant

Looking at the other side of this …

There are concerns that this technology will lead to widespread unemployment which is beyond the scope of this discussion, but it does touch on the point we should consider. Employees are often a business’s biggest expense, but does that mean it’s sensible to think of AI as primarily a means of cutting HR costs?

I don’t think so.

Think about it!

The fully autonomous, AI-powered, human-free industrial operation seem to be away from becoming reality and human employees working alongside AI machines is likely to be the way of things. How can an intelligence developed by humans REPLACE a human? Surely it can replace repetitive mechanizable efforts of a human at some places where artificial intelligence can work. so if you’re looking to generate value in the near future, then thinking about ways to empower humans with technology, rather than replace them, is likely to be more productive.In doing these things we can free people to put all of our creativity, passion, and imagination into thinking about the bigger opportunities ahead of us.

Trends are only disruptive if we are unprepared to factor them into our strategy. How trends impact our workforce, customers, market, services and in turn our lives should be carefully pondered. And perhaps most importantly, a business needs a clear use case and a genuine perception of how, and why, they can gain value from it. With anything new and exuberant in business, there’s often a race to be involved, driven primarily by a fear of being left behind. Scrambling into automating and smartening an enterprise without having a clear outlook of what you hope to achieve is a misdirection to intelligence.

As said by Mark Zukerberg, “A frustration I have is that a lot of people increasingly seem to equate an advertising business model with somehow being out of alignment with your customers, … I think it’s the most ridiculous concept. What, you think because you’re paying Apple that you’re somehow in alignment with them? If you were in alignment with them, then they’d make their products a lot cheaper!”

Another frustration we should feel is … we increasingly seem to diverge efforts put in various technology trends being out of alignment with their use and impact on our life, I think it’s even more ridiculous concept. To be productive, efforts need to be meticulous and put in the proper direction and AI can help find this direction quick and easy. If we were in alignment with the constructive use and right influence of technology trends, then it’d make our lives easier and happier!

Let’s embrace the change and explore integrity!

Image credits: Google

Recommending to watch.


– Difference Between Artificial Intelligence, Machine Learning, and Deep Learning?
Difference between Artificial Intelligence and Machine Learning