Fixing an AWS EC2 Instance Boot Up Issue

Background

We recently had a problem with one of our AWS EC2 Instances after shutting it down, making some configuration changes and starting it back up. We were unable to SSH onto the machines despite the fact that the machine came up OK (we would keep getting a Connection Refused error). We reviewed the Security Group settings, Network Settings, reverted our configuration changes, made sure we were pointing to the correct IP address and much more, but we still couldn’t SSH onto the machine.

Upon viewing the system logs, we noticed that one of the disk volumes failed to be mounted onto the machine. It was an Instance Store drive that apparently was remounted onto the machine after restarting it under a different device name. This prevented the boot up from completing, which as a result prevented the sshd daemon from being started up to allow us to SSH onto the machine. With us not being able to SSH onto the machine to effect repairs we were left dead in the water. But we eventually figured a way to allow us to view the file system and make the necessary changes to fix the issue, which is described in this blog post.

In our case it was an issue with the /etc/fstab that caused us to have to follow these steps, but there are other cases where these steps can also benefit you. For example, if you mistakingly configured sshd not to start on startup of the machine or if something else failed to run during boot up which prevented the sshd daemon from starting up.

High Level Process

To resolve this, we’re going to basically unmount the bad machines root file system, mount it to a healthy machine so we can explore the file system and fix the issue, and then remount it back to the original instance.

Step by Step Process

Setup

Suppose we have our EC2 instance (call it prod-instance) which has booted up ok, but we’re unable to SSH onto.

Setup

Steps

  1. Loin to the AWS Web Console
  2. Stop the prod-instance instance
  3. Detach the root EBS volume from the prod-instance
    1. Select the prod-instance EC2 instance in the AWS console and view the content in the “Description” tab in the window bellow the instance list
    2. Search for the “Root device” field
    3. Click on the link next to it
      • It should look something like this: /dev/xvda
      • A dialog box will pop up
        Block Device Modal
    4. Take a note of the EBS ID
      • For the bellow the steps bellow, assume the EBS ID is vol-0c7bf2325c6ab485b
    5. Click on the EBS ID link
      • This will take you to a new list with information on that EBS Volume
        Available Volumes
    6. Make sure the EBS Volume vol-0c7bf2325c6ab485b is selected and click Actions -> Detach Volume
      Attached Volume Actions
    7. If you would like to abort this and reattach the volume, Jump to step #15
  4. Create a brand new micro instance that you’re able to SSH into and let it startup. We’ll call it maintenance-instance.
    • Make sure that its in the same Region and Availability Zone of the machine you detached the root volume from. Volumes cannot switch between availability zones.
    • Note: Be sure you can SSH onto the machine before proceeding forward
      ssh -i {pem_file} {username}@{ec2_host_or_ip}
       Prod Instance Stopped
  5. Mount the prod-instance‘s old root EBS volume to the maintenance-instance as an additional drive
    1. Click on the “Volumes” link on the left side of the AWS EC2 Web Console under ELASTIC BLOCK STORE
    2. Search for the EBS Volume you detached (vol-0c7bf2325c6ab485b). It will also be listed as having the State “available” (as opposed to “in-use”).
      Volume available
    3. Select the volume and click Actions -> Attach Volume
      Detached Volume Actions
    4. This will open a modal
      Attach Volume
    5. Search for your the maintenance-instance and click on the entry
      Instance Added to Attach Volume
      • By clicking on the entry it will put in a default value into the Device field. If it doesn’t, you can put in the value /dev/sdf.
    6. Click Attach
    7. Note: You do not need to stop or restart maintenance-instance before or after attaching the instance. 
  6. SSH onto the maintenance-instance
  7. Login as root
    sudo su
  8. Check the disk to ensure that the prod-instance‘s old root EBS volume is available and get the device name
    1. Run the following command to get information about what volumes are currently mounted (which should only be the default root volume at this point)
      df -h
      • This will produce a result like this:
        Filesystem Size Used Avail Use% Mounted on
        devtmpfs 488M 64K 488M 1% /dev
        tmpfs 498M 0 498M 0% /dev/shm
        /dev/xvda1 7.8G 981M 6.7G 13% /
      • What this tells you is that there is one main drive called /dev/xvda1 which is the root volume of the maintenance-instance. Thus we can ignore this device name.
    2. Run the following command to find out what the device name is of the volume we want to effect repairs on
      lsblk
      • This will produce a result like this:
        NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
        xvda 202:0 0 8G 0 disk
        └─xvda1 202:1 0 8G 0 part /
        xvdf 202:80 0 8G 0 disk
        └─xvdf1 202:81 0 8G 0 part
      • What this tells you, is that there are 2 main disks attached, each with one partition. We’ve already found out that the xvda device is the original root volume of the maintenance-instance, so by process of elimination xvdf is the disk we mounted onto the machine and want to effect repairs on.
    3. Get the device name of the volume you mounted onto the machine
      1. In our case, based off the output above, the device name is: /dev/xvdf1 (which is the partition of that disk)
      2. Note: you may have noticed the device name also available in the AWS console under the Description of the machine and under the Block devices section. However, the value provided in the AWS console isn’t always the same as the one you will see when using the fdisk or lsblk command, so therefore you shouldn’t use this value. Use the one provided in the fdisk or lsblk command.
  9. Create the directory that you want to mount the volume to (this can be named and placed wherever you would like)
    mkdir /badvolume
  10. Mount the drives partition to the directory
    mount /dev/xvdf1 /badvolume
  11. Explore the file system and make the necessary change you would like to it
    • Change directory to the newly mounted file system
      cd /badvolume
    • Note: In our case, since we were dealing with a mounting issue, we had to modify the /etc/fstab file to prevent the machine from trying to mount the volume that was failing. Since theprod-instance‘s root volume was mounted onto the /badvolume directory, the fstab file that we need to fix is at /badvolume/etc/fstab.
      • We simply commented out the bad entry and then moved on
    • When you have completed your repairs, move onto the next step
  12. Unmount the drive from the machine
    umount /badvolume
  13. Switch back to the AWS Web Console
  14. Detach the vol-0c7bf2325c6ab485b volume from the maintenance-instance
    1. Click on the “Volumes” link on the left side of the AWS Web Console under ELASTIC BLOCK STORE
    2. Search for the EBS Volume you detached (vol-0c7bf2325c6ab485b). It will also be listed as having the State “in-use”.
    3. Select the volume and click Actions -> Detach Volume
      Attached Volume Actions
  15. Re-Attach the vol-0c7bf2325c6ab485b volume to the prod-instance as the root volume
    1. Click on the “Volumes” link on the left side of the AWS Web Console under ELASTIC BLOCK STORE
    2. Search for the EBS Volume you detached (vol-0c7bf2325c6ab485b). It will also be listed as having the State “available”.
    3. Select the volume and click Actions -> Attach Volume
      Detached Volume Actions
    4. This will open a modal
      Attach Volume
    5. Search for your the prod-instance
    6. Set the Device as the root volume with the value: /dev/xvda
      Instance Added to Attach Volume
    7. Click Attach
  16. Restart the prod-instance
  17. Test SSH’ing onto the prod-instance
  18. If you’re still having issues connecting to the prod-instance then check the system logs of the machine to debug the problem and, if necessary, repeat these steps to fix the issue with the drive.
  19. When you’re all done you can terminate the maintenance-instance

Setting up an Apache Airflow Cluster

In one of our previous blog posts, we described the process you should take when Installing and Configuring Apache Airflow.  In this post, we will describe how to setup an Apache Airflow Cluster to run across multiple nodes. This will provide you with more computing power and higher availability for your Apache Airflow instance.

Airflow Daemons

A running instance of Airflow has a number of Daemons that work together to provide the full functionality of Airflow. The daemons include the Web Server, Scheduler, Worker, Kerberos Ticket Renewer, Flower and others. Bellow are the primary ones you will need to have running for a production quality Apache Airflow Cluster.

Web Server

A daemon which accepts HTTP requests and allows you to interact with Airflow via a Python Flask Web Application. It provides the ability to pause, unpause DAGs, manually trigger DAGs, view running DAGs, restart failed DAGs and much more.

The Web Server Daemon starts up gunicorn workers to handle requests in parallel. You can scale up the number of gunicorn workers on a single machine to handle more load by updating the ‘workers’ configuration in the {AIRFLOW_HOME}/airflow.cfg file.

Example

workers = 4

Startup Command:

$ airflow webserver
Scheduler

A daemon which periodically polls to determine if any registered DAG and/or Task Instances needs to triggered based off its schedule.

Startup Command:

$ airflow scheduler
Executors/Workers

A daemon that handles starting up and managing 1 to many CeleryD processes to execute the desired tasks of a particular DAG.

This daemon only needs to be running when you set the ‘executor ‘ config in the {AIRFLOW_HOME}/airflow.cfg file to ‘CeleryExecutor’. It is recommended to do so for Production.

Example:

executor = CeleryExecutor

Startup Command:

$ airflow worker

How do the Daemons work together?

One thing to note about the Airflow Daemons is that they don’t register with each other or even need to know about each other. Each of them handle their own assigned task and when all of them are running, everything works as you would expect.

  1. The Scheduler periodically polls to see if any DAGs that are registered in the MetaStore need to be executed. If a particular DAG needs to be triggered (based off the DAGs Schedule), then the Scheduler Daemon creates a DagRun instance in the MetaStore and starts to trigger the individual tasks in the DAG. The scheduler will do this by pushing messages into the Queueing Service. Each message contains information about the Task it is executing including the DAG Id, Task Id and what function needs to be performed. In the case where the Task is a BashOperator with some bash code, the message will contain this bash code.
  2. A user might also interact with the Web Server and manually trigger DAGs to be ran. When a user does this, a DagRun will be created and the scheduler will start to trigger individual Tasks in the DAG in the same way that was mentioned in #1.
  3. The celeryd processes controlled by the Worker daemon, will pull from the Queueing Service on regular intervals to see if there are any tasks that need to be executed. When one of the celeryd processes pulls a Task message, it updates the Task instance in the MetaStore to a Running state and tries to execute the code provided. If it succeeds then it updates the state as succeeded but if the code fails while being executed then it updates the Task as failed.

Single Node Airflow Setup

A simple instance of Apache Airflow involves putting all the services on a single node like the bellow diagram depicts.

Apache Airflow Single-Node Cluster

Multi-Node (Cluster) Airflow Setup

A more formal setup for Apache Airflow is to distribute the daemons across multiple machines as a cluster.

Apache Airflow Multi-Node Cluster

Benefits

Higher Availability

If one of the worker nodes were to go down or be purposely taken offline, the cluster would still be operational and tasks would still be executed.

Distributed Processing

If you have a workflow with several memory intensive tasks, then the tasks will be better distributed to allow for higher utilizaiton of data across the cluster and provide faster execution of the tasks.

Scaling Workers

Horizontally

You can scale the cluster horizontally and distribute the processing by adding more executor nodes to the cluster and allowing those new nodes to take load off the existing nodes. Since workers don’t need to register with any central authority to start processing tasks, the machine can be turned on and off without any downtime to the cluster.

Vertically

You can scale the cluster vertically by increasing the number of celeryd daemons running on each node. This can be done by increasing the value in the ‘celeryd_concurrency’ config in the {AIRFLOW_HOME}/airflow.cfg file.

Example:

celeryd_concurrency = 30

You may need to increase the size of the instances in order to support a larger number of celeryd processes. This will depend on the memory and cpu intensity of the tasks you’re running on the cluster.

Scaling Master Nodes

You can also add more Master Nodes to your cluster to scale out the services that are running on the Master Nodes. This will mainly allow you to scale out the Web Server Daemon incase there are too many HTTP requests coming for one machine to handle or if you want to provide Higher Availability for that service.

One thing to note is that there can only be one Scheduler instance running at a time. If you have multiple Schedulers running, there is a possibility that multiple instances of a single task will be scheduled. This could cause some major problems with your Workflow and cause duplicate data to show up in the final table if you were running some sort of ETL process.

If you would like, the Scheduler daemon may also be setup to run on its own dedicated Master Node.

Apache Airflow Multi-Master Node Cluster

Apache Airflow Cluster Setup Steps

Pre-Requisites
  • The following nodes are available with the given host names:
    • master1
      • Will have the role(s): Web Server, Scheduler
    • master2
      • Will have the role(s): Web Server
    • worker1
      • Will have the role(s): Worker
    • worker2
      • Will have the role(s): Worker
  • A Queuing Service is Running. (RabbitMQ, AWS SQS, etc)
    • You can install RabbitMQ by following these instructions: Installing RabbitMQ
      • If you’re using RabbitMQ, it is recommended that it is also setup to be a cluster for High Availability. Setup a Load Balancer to proxy requests to the RabbitMQ instances.
Steps
  1. Install Apache Airflow on ALL machines that will have a role in the Airflow
  2. Apply Airflow Configuration changes to all ALL machines. Apply changes to the {AIRFLOW_HOME}/airflow.cfg file.
    1. Change the Executor to CeleryExecutor
      executor = CeleryExecutor
    2. Point SQL Alchemy to the MetaStore
      sql_alchemy_conn = mysql://{USERNAME}:{PASSWORD}@{MYSQL_HOST}:3306/airflow
    3. Set the Broker URL
      1. If you’re using RabbitMQ:
        broker_url = amqp://guest:guest@{RABBITMQ_HOST}:5672/
      2. If you’re using AWS SQS:
        broker_url = sqs://{ACCESS_KEY_ID}:{SECRET_KEY}@
         
        #Note: You will also need to install boto:
        $ pip install -U boto
    4. Point Celery to the MetaStore
      celery_result_backend = db+mysql://{USERNAME}:{PASSWORD}@{MYSQL_HOST}:3306/airflow
  3. Deploy your DAGs/Workflows on master1 and master2 (and any future master nodes you might add)
  4. On master1, initialize the Airflow Database (if not already done after updating the sql_alchemy_conn configuration)
    airflow initdb
  5. On master1, startup the required role(s)
    • Startup Web Server
      $ airflow webserver
    • Startup Scheduler
      $ airflow scheduler
  6. On master2, startup the required role(s)
    • Startup Web Server
      $ airflow webserver
  7. On worker1 and worker2, startup the required role(s)
    • Startup Worker
      $ airflow worker
  8. Create a Load Balancer to balance requests going to the Web Servers
    • If you’re in AWS you can do this with the EC2 Load Balancer
    • If you’re not on AWS you can use something like haproxy to proxy/balance requests to the Web Servers
  9. You’re done!

Additional Documentation

Documentation: https://airflow.incubator.apache.org/

Install Documentation: https://airflow.incubator.apache.org/installation.html

GitHub Repo: https://github.com/apache/incubator-airflow