Setting up an Apache Airflow Cluster

In one of our previous blog posts, we described the process you should take when Installing and Configuring Apache Airflow.  In this post, we will describe how to setup an Apache Airflow Cluster to run across multiple nodes. This will provide you with more computing power and higher availability for your Apache Airflow instance.

Airflow Daemons

A running instance of Airflow has a number of Daemons that work together to provide the full functionality of Airflow. The daemons include the Web Server, Scheduler, Worker, Kerberos Ticket Renewer, Flower and others. Bellow are the primary ones you will need to have running for a production quality Apache Airflow Cluster.

Web Server

A daemon which accepts HTTP requests and allows you to interact with Airflow via a Python Flask Web Application. It provides the ability to pause, unpause DAGs, manually trigger DAGs, view running DAGs, restart failed DAGs and much more.

The Web Server Daemon starts up gunicorn workers to handle requests in parallel. You can scale up the number of gunicorn workers on a single machine to handle more load by updating the ‘workers’ configuration in the {AIRFLOW_HOME}/airflow.cfg file.


workers = 4

Startup Command:

$ airflow webserver

A daemon which periodically polls to determine if any registered DAG and/or Task Instances needs to triggered based off its schedule.

Startup Command:

$ airflow scheduler

A daemon that handles starting up and managing 1 to many CeleryD processes to execute the desired tasks of a particular DAG.

This daemon only needs to be running when you set the ‘executor ‘ config in the {AIRFLOW_HOME}/airflow.cfg file to ‘CeleryExecutor’. It is recommended to do so for Production.


executor = CeleryExecutor

Startup Command:

$ airflow worker

How do the Daemons work together?

One thing to note about the Airflow Daemons is that they don’t register with each other or even need to know about each other. Each of them handle their own assigned task and when all of them are running, everything works as you would expect.

  1. The Scheduler periodically polls to see if any DAGs that are registered in the MetaStore need to be executed. If a particular DAG needs to be triggered (based off the DAGs Schedule), then the Scheduler Daemon creates a DagRun instance in the MetaStore and starts to trigger the individual tasks in the DAG. The scheduler will do this by pushing messages into the Queueing Service. Each message contains information about the Task it is executing including the DAG Id, Task Id and what function needs to be performed. In the case where the Task is a BashOperator with some bash code, the message will contain this bash code.
  2. A user might also interact with the Web Server and manually trigger DAGs to be ran. When a user does this, a DagRun will be created and the scheduler will start to trigger individual Tasks in the DAG in the same way that was mentioned in #1.
  3. The celeryd processes controlled by the Worker daemon, will pull from the Queueing Service on regular intervals to see if there are any tasks that need to be executed. When one of the celeryd processes pulls a Task message, it updates the Task instance in the MetaStore to a Running state and tries to execute the code provided. If it succeeds then it updates the state as succeeded but if the code fails while being executed then it updates the Task as failed.

Single Node Airflow Setup

A simple instance of Apache Airflow involves putting all the services on a single node like the bellow diagram depicts.

Apache Airflow Single-Node Cluster

Multi-Node (Cluster) Airflow Setup

A more formal setup for Apache Airflow is to distribute the daemons across multiple machines as a cluster.

Apache Airflow Multi-Node Cluster


Higher Availability

If one of the worker nodes were to go down or be purposely taken offline, the cluster would still be operational and tasks would still be executed.

Distributed Processing

If you have a workflow with several memory intensive tasks, then the tasks will be better distributed to allow for higher utilizaiton of data across the cluster and provide faster execution of the tasks.

Scaling Workers


You can scale the cluster horizontally and distribute the processing by adding more executor nodes to the cluster and allowing those new nodes to take load off the existing nodes. Since workers don’t need to register with any central authority to start processing tasks, the machine can be turned on and off without any downtime to the cluster.


You can scale the cluster vertically by increasing the number of celeryd daemons running on each node. This can be done by increasing the value in the ‘celeryd_concurrency’ config in the {AIRFLOW_HOME}/airflow.cfg file.


celeryd_concurrency = 30

You may need to increase the size of the instances in order to support a larger number of celeryd processes. This will depend on the memory and cpu intensity of the tasks you’re running on the cluster.

Scaling Master Nodes

You can also add more Master Nodes to your cluster to scale out the services that are running on the Master Nodes. This will mainly allow you to scale out the Web Server Daemon incase there are too many HTTP requests coming for one machine to handle or if you want to provide Higher Availability for that service.

One thing to note is that there can only be one Scheduler instance running at a time. If you have multiple Schedulers running, there is a possibility that multiple instances of a single task will be scheduled. This could cause some major problems with your Workflow and cause duplicate data to show up in the final table if you were running some sort of ETL process.

If you would like, the Scheduler daemon may also be setup to run on its own dedicated Master Node.

Apache Airflow Multi-Master Node Cluster

Apache Airflow Cluster Setup Steps

  • The following nodes are available with the given host names:
    • master1
      • Will have the role(s): Web Server, Scheduler
    • master2
      • Will have the role(s): Web Server
    • worker1
      • Will have the role(s): Worker
    • worker2
      • Will have the role(s): Worker
  • A Queuing Service is Running. (RabbitMQ, AWS SQS, etc)
    • You can install RabbitMQ by following these instructions: Installing RabbitMQ
      • If you’re using RabbitMQ, it is recommended that it is also setup to be a cluster for High Availability. Setup a Load Balancer to proxy requests to the RabbitMQ instances.
  1. Install Apache Airflow on ALL machines that will have a role in the Airflow
  2. Apply Airflow Configuration changes to all ALL machines. Apply changes to the {AIRFLOW_HOME}/airflow.cfg file.
    1. Change the Executor to CeleryExecutor
      executor = CeleryExecutor
    2. Point SQL Alchemy to the MetaStore
      sql_alchemy_conn = mysql://{USERNAME}:{PASSWORD}@{MYSQL_HOST}:3306/airflow
    3. Set the Broker URL
      1. If you’re using RabbitMQ:
        broker_url = amqp://guest:guest@{RABBITMQ_HOST}:5672/
      2. If you’re using AWS SQS:
        broker_url = sqs://{ACCESS_KEY_ID}:{SECRET_KEY}@
        #Note: You will also need to install boto:
        $ pip install -U boto
    4. Point Celery to the MetaStore
      celery_result_backend = db+mysql://{USERNAME}:{PASSWORD}@{MYSQL_HOST}:3306/airflow
  3. Deploy your DAGs/Workflows on master1 and master2 (and any future master nodes you might add)
  4. On master1, initialize the Airflow Database (if not already done after updating the sql_alchemy_conn configuration)
    airflow initdb
  5. On master1, startup the required role(s)
    • Startup Web Server
      $ airflow webserver
    • Startup Scheduler
      $ airflow scheduler
  6. On master2, startup the required role(s)
    • Startup Web Server
      $ airflow webserver
  7. On worker1 and worker2, startup the required role(s)
    • Startup Worker
      $ airflow worker
  8. Create a Load Balancer to balance requests going to the Web Servers
    • If you’re in AWS you can do this with the EC2 Load Balancer
      • Sample Configurations:
        • Port Forwarding
          • Port 8080 (HTTP) → Port 8080 (HTTP)
        • Health Check
          • Protocol: HTTP
          • Ping Port: 8080
          • Ping Path: /
          • Success Code: 200,302
    • If you’re not on AWS you can use something like haproxy to proxy/balance requests to the Web Servers
      • Sample Configurations:
         log local2
         chroot /var/lib/haproxy
         pidfile /var/run/
         maxconn 4000
         user haproxy
         group haproxy
         # turn on stats unix socket
         # stats socket /var/lib/haproxy/stats
         mode tcp
         log global
         option tcplog
         option tcpka
         retries 3
         timeout connect 5s
         timeout client 1h
         timeout server 1h
        # port forwarding from 8080 to the airflow webserver on 8080
        listen impala
         balance roundrobin
         server airflow_webserver_1 host1:8080 check
         server airflow_webserver_2 host2:8080 check
        # This sets up the admin page for HA Proxy at port 1936.
        listen stats :1936
         mode http
         stats enable
         stats uri /
         stats hide-version
         stats refresh 30s
  9. You’re done!

Additional Documentation


Install Documentation:

GitHub Repo:


  1. Tim

    Did you actually set this up? I have have a more simplified version and cannot seem to connect to the webserver while using an ELB.

    1. Robert Sanders Post author

      Yes we have set this up in AWS to use the ELB.

      I assume you checked that the Web Servers are running on the Master nodes right and that the load balancer is able to communicate with them?

      Do you see healthy instances in the Target Group (Assuming you’re using the Application Load Balancer as opposed to the Classic Load Balancer)?

      One tricky thing it might be is the Health Check configurations. Here’s what we’ve put for this check:

      • Protocol: HTTP
      • Ping Port: 8080
      • Ping Path: /
      • Success Code: 200,302

      Notice the success code can be 302. This is because if you hit the path / it redirects to /admin/

      Can you provide more details on what you’re seeing?

  2. kush patel

    I have followed instructions, everything is seems to be working. However, I am getting following error when I run airflow worker on slave node.

    consumer: Cannot connect to amqp://user:**@ip-11-222-11-117:5672//: Couldn’t log in: a socket error occurred.
    Trying again in 2.00 seconds…

    1. Robert Sanders Post author

      This indicates that you’re not able to connect to rabbitmq from the machine you’re running airflow. It appears that it was able to resolve the host (otherwise you would have gotten an error like ‘Cannot connect to amqp://guest:**@HOST:5672//: [Errno -2] Name or service not known.’). So that would tell me that it can’t connect to the port.

      If you’re in AWS then you would need to update the security groups of the machine you have rabbitmq running on to allow access to the 5672 port.

      You can test if this is the problem by executing the command:
      $ telnet ip-11-222-11-117 5672
      Note: you may need to install telnet

      If it connects then you know that it should work for Airflow

  3. Pingback: Making Apache Airflow Highly Available - Home

  4. Ram


    I am getting the following error.

    [2017-07-18 17:36:15,546: ERROR/MainProcess] consumer: Cannot connect to amqp://user:**@ip-10-131-129-148:5672//: [Errno 104] Connection reset by peer.
    Trying again in 2.00 seconds…

    What could be the reason?

    1. Robert Sanders Post author

      A “Connection reset by peer” failure implies that it was able to connect to a server over that port but that it got a request that it didn’t like. What I suspect may be the issue is that it failed to authenticate with RabbitMQ. I would check and make sure the credentials you’re using for connecting to RabbitMQ are correct.

  5. Ram


    Thanks for your response. I am using correct credentials but some how Rabbit MQ was not working. I tried with Redis and working successfully. My next ask is how to avoid clear text passwords in airflow.cfg. I tried by creating postgres connection in Web Admin UI and specified connection id in airflow.cfg as shown below but its not working.

    sql_alchemy_conn = postgresql+psycopg2://conn_postgres

    1. Robert Sanders Post author

      I haven’t found a way to avoid putting the plain text passwords in the airflow.cfg file.

      Regarding creating a connection in the Airflow WebUI, this approach won’t work since the connections you’re defining are stored in the Metastore DB anyways. Airflow would still need to know how to connect to the Metastore DB so that it could retrieve them.

      There are a few strategies that you can follow to secure things which we implement regularly:

      • Modify the airflow.cfg file permissions to allow only the airflow user the ability to read from that file. This will prevent others from reading the file.
      • Modify the network rules to only allow connection to the Metastore DB and RabbitMQ instance from the Airflow instances
      • Restrict the Metastore DB Permissions for the user you’re using to only allow the user to effect the airflow related tables
  6. Tea-Hyoung

    Hi, Robert,

    Thank you. I have a question.
    I made up the system According to your guidance. rabbitmq cluster, airflow cluster.

    I ran a DAG through Airflow. And I checked Task that it was piled up in the queue of RabbitMQ.
    But the Work Node is not getting task from the queue.

    The worker.logs on the worker node are as follows:
    017-08-21 11:54:44,153] {} INFO – Using executor CeleryExecutor
    [2017-08-21 11:54:44,210] {} INFO – Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
    [2017-08-21 11:54:44,227] {} INFO – Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
    Running a worker with superuser privileges when the
    worker accepts messages serialized with pickle is a very bad idea!

    If you really want to continue then you have to set the C_FORCE_ROOT
    environment variable (but please think about this before you do).

    User information: uid=0 euid=0 gid=0 egid=0

    [2017-08-21 11:54:44,726] {} INFO – Using executor CeleryExecutor
    [2017-08-21 11:54:44,783] {} INFO – Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
    [2017-08-21 11:54:44,802] {} INFO – Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
    Starting flask
    [2017-08-21 11:54:44,925] {} INFO – * Running on (Press CTRL+C to quit)

    It is difficult to find a problem.
    Please let me know if you know the cause.

    1. Tea-Hyoung


      The flower.logs on the worker node are as follows:

      [2017-08-21 11:54:44,212] {} INFO – Using executor CeleryExecutor
      [2017-08-21 11:54:44,275] {} INFO – Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
      [2017-08-21 11:54:44,297] {} INFO – Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
      [I 170821 11:54:44 command:139] Visit me at
      [I 170821 11:54:44 command:144] Broker: amqp://rabbitmq:**@x.x.x.x:5672//
      [I 170821 11:54:44 command:147] Registered tasks:
      [I 170821 11:54:44 mixins:224] Connected to amqp://rabbitmq:**@x.x.x.x:5672//
      [W 170821 11:54:46 control:44] ‘stats’ inspect method failed
      [W 170821 11:54:46 control:44] ‘active_queues’ inspect method failed
      [W 170821 11:54:46 control:44] ‘registered’ inspect method failed
      [W 170821 11:54:46 control:44] ‘scheduled’ inspect method failed
      [W 170821 11:54:46 control:44] ‘active’ inspect method failed
      [W 170821 11:54:46 control:44] ‘reserved’ inspect method failed
      [W 170821 11:54:46 control:44] ‘revoked’ inspect method failed
      [W 170821 11:54:46 control:44] ‘conf’ inspect method failed

      1. Tea-Hyoung

        I solved it. The problem is because it is acted upon with an account of root privilege. ‘airflow work’ command executed.

  7. Duleendra Shashimal

    Hi Robert,

    Thank you for the informative post. I set up a cluster and it’s working fine.
    I would like to know about the minimum hardware requirements (RAM, CPU, Disk) for airflow cluster nodes. Specially for a production set up.

  8. Chris Merry

    Great article! In fact, the entire Airflow series has been extremely helpful.

    Any idea why a worker node would try to connect to RabbitMQ on the localhost despite the BROKER_URL set to something entirely different? Other parameter values seem to be picked up correctly, such as SQL_ALCHEMY_CONN. I’m able to telnet to the AMQP node. Just doesn’t seem to want to use the BROKER_URL value.

    Thanks in advance!

  9. Pingback: apache airflow 複数worker構成のalpine版docker imageを作った | せかいらぼ

  10. Sunnysecc

    I have followed all the steps and installed airflow with 2 workers and 1 master node in different VM’s but DB and RabbitMQ is in master . How will the master know the DAG’s from another worker in another VM.

  11. Rohini Basu

    Hi Robert,

    We are starting with Airflow and have a very basic configuration of 1 master and 2 workers.

    Node 1: Webserver, Scheduler, Worker
    Node 2: Worker

    I am facing an issue while running the worker in Node 2.
    [2018-06-25 15:27:03,937: ERROR/MainProcess] consumer: Cannot connect to redis://:**@:/0: Error 111 connecting to Connection refused..
    Trying again in 2.00 seconds…

    I was able to ping the machine where redis is running from Node 2.
    airflow.cfg file is homogenous across these 2 nodes. What am I missing here?

  12. somenzz

    Thank you for your article, but I have a question ,is Queueing Service [ broker_url = amqp://guest:guest@{RABBITMQ_HOST}:5672/]

  13. somenzz

    If airflow scheduler break down, what’s to be done?

  14. Tony


    It’s been a while since this post, hopefully someone can clarify this for me. Can this work only with the Master Node 1 ? What happens if we remove Master Node 2 and Load Balancer? Would it still work?


  15. Steve

    Hi Robert,

    I tried to follow your instructions setting a 2-node architecture (master1, worker1) on GCP (airflow 1.10.1, python3).

    When I run the airflow worker command on worker1, I get an error message

    consumer: Cannot connect to amqp://guest:**@localhost:5672//: Error opening socket: a socket error occurred.
    Trying again in 2.00 seconds…

    I assume the above message indicates that I have not set up broker_url properly. Below are the airflow.cfg configs on master1 and worker1.

    executor = CeleryExecutor
    sql_alchemy_conn = postgresql+psycopg2://db_user_name:db_password@localhost:5432/airflow
    broker_url = amqp://rabbit_user_name:rabbit_password@localhost/
    result_backend = db+postgresql://db_user_name:db_password@localhost:5432/airflow
    fernet_key = XXXXXXXXXXX

    executor = CeleryExecutor
    sql_alchemy_conn = postgresql+psycopg2://db_user_name:db_password@master1_ip:5432/airflow
    broker_url = amqp://rabbit_user_name:rabbit_password@master1_ip:5672/
    result_backend = db+postgresql://db_user_name:db_password@master1_ip:5432/airflow
    fernet_key = XXXXXXXXXXX

    1) Should I specify port on master1 broker_url i.e. broker_url = amqp://rabbit_user_name:rabbit_password@localhost:5672/?
    2) I have created a vhost on rabbitMQ (rabbit_vhost_name), should I use it in both broker_urls
    i.e. master1->broker_url = amqp://rabbit_user_name:rabbit_password@localhost/rabbit_vhost_name
    worker1->broker_url = amqp://rabbit_user_name:rabbit_password@master1_ip:5672/rabbbit_vhost_name
    3) Regarding firewalls, are you aware whether I should allow HTTP and HTTPS traffic on master1? I have defined a firewall rule to open up port 5672 from a selected range on IP. Without allow HTTP and HTTPS I can telnet from worker1 to master1 on port 5672.
    4) Should the fernet_key be exactly the same on master1 and worker1?

    Thanks you.

  16. Gaurav Agnihotri

    getting the error when I run “airflow list_dags: (my all the services are running fine.

    /home/airflow/dags$ airflow list_dags
    Traceback (most recent call last):
    File “/usr/local/bin/airflow”, line 16, in
    from airflow import configuration
    File “/usr/local/lib/python2.7/dist-packages/airflow/”, line 30, in
    from airflow import configuration as conf
    File “/usr/local/lib/python2.7/dist-packages/airflow/”, line 321, in
    File “/usr/local/lib/python2.7/dist-packages/airflow/”, line 310, in mkdir_p
    raise AirflowConfigException(‘Had trouble creating a directory’)
    airflow.exceptions.AirflowConfigException: Had trouble creating a directory

  17. Sivamani

    why we need rabitMq in ai flow cant we run it without rabitmq or any messaging srevice ?

  18. Matt

    Hi Robert, Like the way you narrated about Airflow in your blog. It provided us good insights on how different components work together run time.
    However, I question and that keeps me bugging.

    For example – I have a DAG that is running daily on a fixed schedule. When it runs, this DAG processes and update 10 records in an Oracle table.
    While updating the data, I also want to capture a “batch id” or a “request_id” to identify what batch these 10 records got updated.
    Is there a way like that?

    if you are familiar with Oracle, it has a field called “request_id” and for each run, the concurrent request id of a program gets updated in this field.
    Is there something similar in Airflow?


  19. Matt

    Hi All, This blog provided me good insights on how different components work together run time.
    However, I have a question on batch processing.

    For example – I have a DAG that is running daily on a fixed schedule. When it runs, this DAG processes and update 10 records in an Oracle table.
    While updating the data, I also want to capture a “batch id” or a “request_id” to identify what batch these 10 records got updated.
    Is there a way like that?

    If you are familiar with Oracle, it has a field called “request_id” and for each run, the concurrent request id of a program gets updated in this field.
    Is there something similar in Airflow?

  20. Pingback: Apache Airflow Architecture – Limitless Data Science

Comments are closed.