Upgrading Apache Airflow Versions

In a previous post we explained how to Install and Configure Apache Airflow (a platform to programmatically author, schedule and monitor workflows). The technology is actively being worked on and more and more features and bug fixes are being added to the project in the form of new releases. At some point, you will want to upgrade to take advantage of these new feature.

In this post we’ll go over the process that you should for upgrading apache airflow versions.

Note: You will need to separately make sure that your dags will be able to work on the new version of Airflow.

Upgrade Airflow

Note: These steps can also work to downgrade versions of Airflow

Note: Execute all of this on all the instances in your Airflow Cluster (if you have more then one machine)

  1. Gather information about your current environment and your target setup:
    • Get the Airflow Home directory. Placeholder for this value: {AIRFLOW_HOME}
    • Get the current version of Airflow you are running. Placeholder for this value: {OLD_AIRFLOW_VERSION}
      1. To get this value you can run:
        $ airflow version
    • Get the new version of Airflow you want to run. Placeholder for this value: {NEW_AIRFLOW_VERSION}
    • Are you using sqlite? Placeholder for this value:{USING_SQLITE?}
    • If you’re not using SQLite, search the airflow.cfg file for the metastore (celery_result_backend and sql_alchemy_conn configurations) type {AIRFLOW_DB_TYPE}, host name {AIRFLOW_DB_HOST}, database schema name {AIRFLOW_DB_SCHEMA}, username {AIRFLOW_DB_USERNAME}, and password {AIRFLOW_DB_PASSWORD}
  2. Ensure the new version of Airflow you want to Install is Available
    1. Run the follow command (don’t forget to include the ‘==’):
      $ pip install airflow==
      • Note: This will throw an error saying that the version is not provided and then show you all the versions available. This is supposed to happen and is a way that you can find out what version are available.
    2. View the list of versions available and make sure the version you want to install ‘{NEW_AIRFLOW_VERSION}’ is available
  3. Shutdown all the Airflow Services on the Master and Worker nodes
    1. webserver
      1. gunicorn processes
    2. scheduler
    3. worker – if applicable
      1. celeryd daemons
    4. flower – if applicable
    5. kerberos ticket renewer – if applicable
  4. Take backups of various components to ensure you can Rollback
    1. Optionally, you can create a directory to house all of these backups. The bellow steps assume you’re going to create this type of folder and push all your objects to the {AIRFLOW_BACKUP_FOLDER}. But you can just as easily rename the files you want to backup if that’s more convenient.
      • Create the backup folder:
        $ mkdir -p {AIRFLOW_BACKUP_FOLDER}
    2. Backup your Configurations
      • Move the airflow.cfg file to the backup folder:
        $ cd {AIRFLOW_HOME}
        $ mv airflow.cfg {AIRFLOW_BACKUP_FOLDER}
    3. Backup your DAGs
      • Zip up the Airflow DAGs folder and move it to the backup folder:
        $ cd {AIRFLOW_HOME}
        $ zip -r airflow_dags.zip dags
        $ mv airflow_dags.zip {AIRFLOW_BACKUP_FOLDER}
      • Note: You may need to install the zip package
    4. Backup your DB/Metastore
      1. If you’re using sqlite ({USING_SQLITE?}):
        • Move the airflow.db sqlite db to the backup folder:
          $ cd {AIRFLOW_HOME}
          $ mv airflow.db {AIRFLOW_BACKUP_FOLDER}
      2. If you’re using a SQL database like MySQL or PostgreSQL, take a dump of the database.
        • If you’re MySQL you can use the following command:
          $ mysqldump --host={AIRFLOW_DB_HOST} --user={AIRFLOW_DB_USERNAME} --password={AIRFLOW_DB_PASSWORD} {AIRFLOW_DB_SCHEMA} > {AIRFLOW_BACKUP_FOLDER}/airflow_metastore_backup.sql
  5. Upgrade Airflow
    1. Run the following PIP command to install Airflow and the required dependencies:
      $ sudo pip install airflow=={NEW_AIRFLOW_VERSION} --upgrade
      $ sudo pip install airflow[hive]=={NEW_AIRFLOW_VERSION} --upgrade
    2. Note: If you installed additional sub-packages of Airflow you will need to upgrade those too
  6. Regenerate and Update Airflow Configurations
    1. Regenerate the airflow.cfg that was backed up using the following command:
      $ airflow initdb
      • Note: The reason you want to regenerate the airflow.cfg file is because between version of airflow, new configurations might have been added or old configurations values (for things that you don’t need to update from the default values) might have changed.
    2. Remove the generated airflow.db file
      $ cd {AIRFLOW_HOME}
      $ rm airflow.db
    3. If you’re using sqlite, copy the old airflow.db file you backed up back to the original place
      $ cd {AIRFLOW_HOME}
      $ cp {AIRFLOW_BACKUP_FOLDER}/airflow.db .
    4. Manually copy all of the individual updated configurations from the old airflow.cfg file that you backed up to the new airflow.cfg file
      • Compare the airflow.cfg files (backed up and new one) to determine which configurations you need to copy over. This may include the following configurations:
        • executor
        • sql_alchemy_conn
        • base_url
        • load_examples
        • broker_url
        • celery_result_backend
    5. Review the airflow.cfg file further to ensure all values are set to the correct value
  7. Upgrade Metastore DB
    • Run the following command:
      $ airflow upgradedb
  8. Restart your Airflow Services
    • The same ones you shutdown in step #3
  9. Test the upgraded Airflow Instance
    • High Level Checklist:
      • Services start up with out errors?
      • DAGs run as expected?
      • Do the plugins you have installed (if any) load and work as expected?
  10. Once/If you want, you can delete the {AIRFLOW_BACKUP_FOLDER} folder and its contents

Rollback Airflow

In the event you encountered a problem during the upgrade process and would like to rollback to the version you already had before, follow these instructions:

  1. Take note of what step you stopped at in the upgrade process
  2. Stop all the Airflow Services
  3. If you reached step #7 in the upgrade steps above (Step: Upgrade Metastore DB)
    1. Restore the database to the original state
      1. If you’re using sqlite ({USING_SQLITE?})
        1. Delete the airflow.db file that’s there and copy the old airflow.db file from your backup folder to its original place:
          $ cd {AIRFLOW_HOME}
          $ rm airflow.db
          $ cp {AIRFLOW_BACKUP_FOLDER}/airflow.db .
      2. If you’re using a SQL database like MySQL or PostgreSQL, restore the dump of the database
        • If you’re using MySQL you can use the following command:
          $ mysql --host={AIRFLOW_DB_HOST} --user={AIRFLOW_DB_USERNAME} --password={AIRFLOW_DB_PASSWORD} {AIRFLOW_DB_SCHEMA} < {AIRFLOW_BACKUP_FOLDER}/airflow_metastore_backup.sql
  4. If you reached step #6 in the upgrade steps above (Step: Regenerate and Update Airflow Configurations)
    • Copy the airflow.cfg file that you backed up back to its original place:
      $ cd {AIRFLOW_HOME}
      $ rm airflow.cfg
      $ cp {AIRFLOW_BACKUP_FOLDER}/airflow.cfg .
  5. If you reached step #5 in the upgrade steps above (Step: Upgrade Airflow)
    • Downgrade Airflow back to the original version:
      $ sudo pip install airflow=={OLD_AIRFLOW_VERSION} --upgrade
      $ sudo pip install airflow[hive]=={OLD_AIRFLOW_VERSION} --upgrade
    • Note: If you installed additional sub-packages of Airflow you will need to downgrade those too
  6. If you reached step #4 in the upgrade steps above (Step: Take backups)
    1. Restore the airflow.cfg file (if you haven’t already done so)
      $ cd {AIRFLOW_HOME}
      $ cp {AIRFLOW_BACKUP_FOLDER}/airflow.cfg .
    2. If you’re using sqlite ({USING_SQLITE?}), restore the airflow.db file (if you haven’t already done so)
      $ cd {AIRFLOW_HOME}
      $ cp {AIRFLOW_BACKUP_FOLDER}/airflow.db .
  7. Restart all the Airflow Services
  8. Test the restored Airflow Instance

Continuous Delivery With GoCD

This blog outlines our experience moving one of our projects to Continuous delivery Model using GoCD on AWS.

Prior to this implementation our code deployments were manual and on demand. We were looking for automated way of deploying code to various environments with minimal manual intervention. GoCD has continuous delivery as a first-class concept and provides an intuitive interface to start building CD pipelines. We started off with a quick PoC to validate some of our understanding and after initial success, we now use GoCD to define all of our deployment/delivery pipelines.

This move forced us to have comprehensive test suite and work flow defined to set the criteria to promote code in different environments. The change also increased our ability to push more but smaller changes frequently.

Deployment vs Delivery

It is fairly often to see the terms Continuous Delivery and Continuous Deployment used interchangeably. For some this is a huge distinction and for some this does not matter.

Continuous Deployment provides the ability to automatically release the new features and changes to production quickly as the code is checked-in. This typically means there are no process related gating functions between code being checked in and a release of that software making it to production. The only(simplification) gating function is whether or not the automated test suite has passed. Any code defect will lead to test failures and will force the deployment to fail and stop, so it is important to write integration tests with maximum scenario coverage in order to proceed towards continuous deployment. Continuous Delivery, while similar to Continuous Deployment differs in one major aspect – that automation goes as far as the process within an organization allows – and then relies on human or other approval processes to then deploy to production. Continuous integration and continuous delivery are a prerequisite for continuous deployment. GoCD is a tool which provides us the ability to create pipelines to accomplish continuous delivery.

Now that we have that squared away, let’s look at GoCD and how we can use it in a bit more depth.

Key GoCD Concepts

  • Environment
  • Pipeline
  • Stage
  • Job
  • Task
  • Go Server
  • Go Agent

Each artifact (service) being deployed to various environments in the form of pipelines can form a pipeline group. A pipeline within a pipeline group will deploy the artifact to an environment (like DEV or QA). A pipeline consists of various stages; each stage consists of jobs which execute in parallel and lastly, each job consists of tasks which execute sequentially. Pipelines can be shared among multiple artifact pipelines using a pipeline template”For example, QA pipelines of 2 different applications – ServiceA and ServiceB can share same pipeline template. Environment variables and other properties can be shared between pipelines belonging to same environment in form of GoCD “environments“. For instance, environment variables created in GoCD DEV environment are available to all GoCD DEV pipelines.

Go-Server and Go-Agent are entities which together provide us the ability to form and run pipelines. Go-server allows us to create pipelines, maintain configurations and any addition data which composes our pipelines. Go-Agents take command from Go-Server and execute the stages of pipeline.

Sample Pipeline Workflow

Below diagram depicts the sample workflow created using GoCD tool. Source control system used is subversion and the deployment is on AWS, but other deployments are similar and instructions can be found https://gocd.io website. This workflow shows deployment through 2 environments : Dev and QA. Once the code gets checked-in, it is picked up by the Build stage of DEV pipeline, which is polling for the changes on trunk. Four stages are set up for pipeline on each environment. In DEV follow the order of build, deployment, integration testing and build promotion for next environment. For QA environment, artifact promoted in DEV is used as the material (input to pipeline).

GoCD Server and Agent Installation on AWS

On the AWS server, go to the location you want to download the files and run these commands in order

$ wget https://download.gocd.io/binaries/16.12.0-4352/rpm/go-server-16.12.0-4352.noarch.rpm
$ sudo rpm -i go-server-16.12.0-4352.noarch.rpm
$ wget https://download.gocd.io/binaries/16.12.0-4352/rpm/go-agent-16.12.0-4352.noarch.rpm
$ sudo rpm -i go-agent-16.12.0-4352.noarch.rpm

For simplicity, we are installing go-server and go-agent on the same machine, but it can exist of different servers. Once installed, go-server can be accessed at https://{go-server-ip}:8154/go/

GoCD provides set of commands to start or stop the server and agents which are located in init.d folder. Go-agent and go-server can be controlled using these commands:

$ /etc/init.d/go-agent [start|stop|status|restart]

By default, go process runs as a ‘go’ user. After startup the server can be accessed using these urls:

https://{go-server-ip}:8154/go [secured]
http://{go-server-ip}:8183/go   [unsecured]

If the server is running, you will find a default pipelines page. Also, go-agent is registered and can be located using this url: https://{go-server-ip}:8154/go/agents

If you want to have separate agents addressing different environments, check the agent you want to modify, click environments and make changes.

Pipeline Setup

Access pipelines admin page : {go-server}:8154/go/admin/pipelines

  • Create a new pipeline group, then add pipeline within this pipeline group.
  • Add material information, for DEV pipeline, it should be one or more svn (or other version controlled) artifact. For consecutive pipelines, it should be a pipeline stage from parent pipeline.

  • Add stages – Build, Deploy, Test, Promote (or different based on requirement)
  • Within stages, add jobs and then add tasks within each job.
  • Task can be one of the primary task types from dropdown list, or execute shell scripts using the “Script Executor” plugin.

  • From the home page, click environments and add a new Pipeline Environment if none exists using “Add A New Environment“.

  • Make sure Pipeline Environment has an agent associated with it in order for any pipeline within it to run. Agents can be added via ‘Add Agents’ tab when you create environment or can be managed using ‘Agents’ tab, as shown above.
  • Once environment is create, go to ‘Environments’ tab from home page, and click on the environment created. It appears like this:

  • Environment can have all the environment variables which are shared by pipelines within it. Also, pipeline can have its own environment variables.
  • If the pipeline is to an automatic one, make sure to check the ‘ ‘ box in pipeline edit page. Make it unchecked for manual deploys, and add cron time for time-scheduled deploys. For example, to run a pipeline everyday at 8 am, use this inside edit pipeline timer settings : 0 0 8 * * ? More documentation on using cron timers is here: https://docs.gocd.io/current/configuration/configuration_reference.html#timer
  • New pipeline is by default in paused state. Once pipeline is created, go to pipelines tab, and click ‘pause’ button to un-pause the pipeline.

Extracting Templates from existing pipelines and Creating pipelines from Template

  • If the pipelines are to be reused, go to admin/pipelines page and click ‘Extract Template’ for the template to extract.

  • To use extracted template, instead of creating stages, click “Use Template” when you create a new pipeline.

Plugin Installation

Script executor plugin used above can be downloaded from here. There are lots of other plugins too provided by GoCD.

To install above plugin, download only jar from releases and place it here: <go-server-location>/plugins/external & restart Go Server.

GoCD Directories

Listed below are the default locations.

Go agent/server config:
$ /etc/default/go-agent
$ /etc/default/go-server
Log path:
$ /var/log/go-agent
Go Server location:
/var/lib/go-server/

Adding new user for gocd server access

There are multiple mechanisms to setup authentication with GoCD. We used File based authentication.

Steps to set up file based authentication:

  1. Create a ‘.passwd’ on GoCD server, say /users/go/.passwd
  2. Add users to this file from command line like this:
    $htpasswd -s .passwd username

    The user is prompted for the password. The password will be encrypted using the modified Apache MD5 algorithm. If the file does not exist, htpasswd will do nothing except return an error. -s to force htpasswd to use SHA1.

  3. Go to Admin page, and click on Server Configuration tab.
  4. Under User Management section, find the ‘Password File Settings’ and mention the path  /users/go/.passwd, then click Save.

Managing Disk Space

GoCD usually needs 1GB of free space, absence of which will lead to GoCD complaining about the disk space issues. There are various ways which can help with freeing up the disk space. Below is what we follow :

  • Make sure ‘Clean Working directory’ is checked in all stages (see pipeline->stage->’Stage Settings’)
  • Deleting pipelines folder from /var/lib/go-agent. It is temporary working directory to maintain pipelines data by go-agent.
  • Compressing logs
  • Moving go-server pipeline artifacts config to new location with higher disk space (like a different mount location). Update ‘artifactsDir’ in Admin→Config XML to this location.

Best Practices to adapt CD

  • Do trunk based deployments, and promote same binaries across all environments.
  • When starting a new service or project, develop on the dev branch or feature branch for the group. Merge with trunk when ready, but use trunk when doing any new deployments after that.
  • Merge your branch to trunk frequently, with set of thorough tests. Update the dev branch often with trunk during development. This ensures less time merging with master.
  • Add Tests (unit, integration, smoke, regression, etc.), without tests CD will always put us on risk of promoting builds with bugs.
  • Use manual pipelines/interference only on rare scenarios. Embrace CD as often as possible.