Upgrading Apache Airflow Versions

In a previous post we explained how to Install and Configure Apache Airflow (a platform to programmatically author, schedule and monitor workflows). The technology is actively being worked on and more and more features and bug fixes are being added to the project in the form of new releases. At some point, you will want to upgrade to take advantage of these new feature.

In this post we’ll go over the process that you should for upgrading apache airflow versions.

Note: You will need to separately make sure that your dags will be able to work on the new version of Airflow.

Upgrade Airflow

Note: These steps can also work to downgrade versions of Airflow

Note: Execute all of this on all the instances in your Airflow Cluster (if you have more then one machine)

  1. Gather information about your current environment and your target setup:
    • Get the Airflow Home directory. Placeholder for this value: {AIRFLOW_HOME}
    • Get the current version of Airflow you are running. Placeholder for this value: {OLD_AIRFLOW_VERSION}
      1. To get this value you can run:
        $ airflow version
    • Get the new version of Airflow you want to run. Placeholder for this value: {NEW_AIRFLOW_VERSION}
    • Are you using sqlite? Placeholder for this value:{USING_SQLITE?}
    • If you’re not using SQLite, search the airflow.cfg file for the metastore (celery_result_backend and sql_alchemy_conn configurations) type {AIRFLOW_DB_TYPE}, host name {AIRFLOW_DB_HOST}, database schema name {AIRFLOW_DB_SCHEMA}, username {AIRFLOW_DB_USERNAME}, and password {AIRFLOW_DB_PASSWORD}
  2. Ensure the new version of Airflow you want to Install is Available
    1. Run the follow command (don’t forget to include the ‘==’):
      $ pip install airflow==
      • Note: This will throw an error saying that the version is not provided and then show you all the versions available. This is supposed to happen and is a way that you can find out what version are available.
    2. View the list of versions available and make sure the version you want to install ‘{NEW_AIRFLOW_VERSION}’ is available
  3. Shutdown all the Airflow Services on the Master and Worker nodes
    1. webserver
      1. gunicorn processes
    2. scheduler
    3. worker – if applicable
      1. celeryd daemons
    4. flower – if applicable
    5. kerberos ticket renewer – if applicable
  4. Take backups of various components to ensure you can Rollback
    1. Optionally, you can create a directory to house all of these backups. The bellow steps assume you’re going to create this type of folder and push all your objects to the {AIRFLOW_BACKUP_FOLDER}. But you can just as easily rename the files you want to backup if that’s more convenient.
      • Create the backup folder:
        $ mkdir -p {AIRFLOW_BACKUP_FOLDER}
    2. Backup your Configurations
      • Move the airflow.cfg file to the backup folder:
        $ cd {AIRFLOW_HOME}
        $ mv airflow.cfg {AIRFLOW_BACKUP_FOLDER}
    3. Backup your DAGs
      • Zip up the Airflow DAGs folder and move it to the backup folder:
        $ cd {AIRFLOW_HOME}
        $ zip -r airflow_dags.zip dags
        $ mv airflow_dags.zip {AIRFLOW_BACKUP_FOLDER}
      • Note: You may need to install the zip package
    4. Backup your DB/Metastore
      1. If you’re using sqlite ({USING_SQLITE?}):
        • Move the airflow.db sqlite db to the backup folder:
          $ cd {AIRFLOW_HOME}
          $ mv airflow.db {AIRFLOW_BACKUP_FOLDER}
      2. If you’re using a SQL database like MySQL or PostgreSQL, take a dump of the database.
        • If you’re MySQL you can use the following command:
          $ mysqldump --host={AIRFLOW_DB_HOST} --user={AIRFLOW_DB_USERNAME} --password={AIRFLOW_DB_PASSWORD} {AIRFLOW_DB_SCHEMA} > {AIRFLOW_BACKUP_FOLDER}/airflow_metastore_backup.sql
  5. Upgrade Airflow
    1. Run the following PIP command to install Airflow and the required dependencies:
      $ sudo pip install airflow=={NEW_AIRFLOW_VERSION} --upgrade
      $ sudo pip install airflow[hive]=={NEW_AIRFLOW_VERSION} --upgrade
    2. Note: If you installed additional sub-packages of Airflow you will need to upgrade those too
  6. Regenerate and Update Airflow Configurations
    1. Regenerate the airflow.cfg that was backed up using the following command:
      $ airflow initdb
      • Note: The reason you want to regenerate the airflow.cfg file is because between version of airflow, new configurations might have been added or old configurations values (for things that you don’t need to update from the default values) might have changed.
    2. Remove the generated airflow.db file
      $ cd {AIRFLOW_HOME}
      $ rm airflow.db
    3. If you’re using sqlite, copy the old airflow.db file you backed up back to the original place
      $ cd {AIRFLOW_HOME}
      $ cp {AIRFLOW_BACKUP_FOLDER}/airflow.db .
    4. Manually copy all of the individual updated configurations from the old airflow.cfg file that you backed up to the new airflow.cfg file
      • Compare the airflow.cfg files (backed up and new one) to determine which configurations you need to copy over. This may include the following configurations:
        • executor
        • sql_alchemy_conn
        • base_url
        • load_examples
        • broker_url
        • celery_result_backend
    5. Review the airflow.cfg file further to ensure all values are set to the correct value
  7. Upgrade Metastore DB
    • Run the following command:
      $ airflow upgradedb
  8. Restart your Airflow Services
    • The same ones you shutdown in step #3
  9. Test the upgraded Airflow Instance
    • High Level Checklist:
      • Services start up with out errors?
      • DAGs run as expected?
      • Do the plugins you have installed (if any) load and work as expected?
  10. Once/If you want, you can delete the {AIRFLOW_BACKUP_FOLDER} folder and its contents

Rollback Airflow

In the event you encountered a problem during the upgrade process and would like to rollback to the version you already had before, follow these instructions:

  1. Take note of what step you stopped at in the upgrade process
  2. Stop all the Airflow Services
  3. If you reached step #7 in the upgrade steps above (Step: Upgrade Metastore DB)
    1. Restore the database to the original state
      1. If you’re using sqlite ({USING_SQLITE?})
        1. Delete the airflow.db file that’s there and copy the old airflow.db file from your backup folder to its original place:
          $ cd {AIRFLOW_HOME}
          $ rm airflow.db
          $ cp {AIRFLOW_BACKUP_FOLDER}/airflow.db .
      2. If you’re using a SQL database like MySQL or PostgreSQL, restore the dump of the database
        • If you’re using MySQL you can use the following command:
          $ mysql --host={AIRFLOW_DB_HOST} --user={AIRFLOW_DB_USERNAME} --password={AIRFLOW_DB_PASSWORD} {AIRFLOW_DB_SCHEMA} < {AIRFLOW_BACKUP_FOLDER}/airflow_metastore_backup.sql
  4. If you reached step #6 in the upgrade steps above (Step: Regenerate and Update Airflow Configurations)
    • Copy the airflow.cfg file that you backed up back to its original place:
      $ cd {AIRFLOW_HOME}
      $ rm airflow.cfg
      $ cp {AIRFLOW_BACKUP_FOLDER}/airflow.cfg .
  5. If you reached step #5 in the upgrade steps above (Step: Upgrade Airflow)
    • Downgrade Airflow back to the original version:
      $ sudo pip install airflow=={OLD_AIRFLOW_VERSION} --upgrade
      $ sudo pip install airflow[hive]=={OLD_AIRFLOW_VERSION} --upgrade
    • Note: If you installed additional sub-packages of Airflow you will need to downgrade those too
  6. If you reached step #4 in the upgrade steps above (Step: Take backups)
    1. Restore the airflow.cfg file (if you haven’t already done so)
      $ cd {AIRFLOW_HOME}
      $ cp {AIRFLOW_BACKUP_FOLDER}/airflow.cfg .
    2. If you’re using sqlite ({USING_SQLITE?}), restore the airflow.db file (if you haven’t already done so)
      $ cd {AIRFLOW_HOME}
      $ cp {AIRFLOW_BACKUP_FOLDER}/airflow.db .
  7. Restart all the Airflow Services
  8. Test the restored Airflow Instance

Installing Apache Zeppelin on a Hadoop Cluster

Apache Zeppelin(https://zeppelin.incubator.apache.org/)  is a web-based notebook that enables interactive data analytics. You can make data-driven, interactive and collaborative documents with SQL, Scala and more.

This document describes the steps you can take to install Apache Zeppelin on a CentOS 7 Machine.

Steps

Note: Run all the commands as Root

Configure the Environment

Install Maven (If not already done)
cd /tmp/
wget https://archive.apache.org/dist/maven/maven-3/3.1.1/binaries/apache-maven-3.1.1-bin.tar.gz
tar xzf apache-maven-3.1.1-bin.tar.gz -C /usr/local
cd /usr/local
ln -s apache-maven-3.1.1 maven
Configure Maven (If not already done)
#Run the following
export M2_HOME=/usr/local/maven
export M2=${M2_HOME}/bin
export PATH=${M2}:${PATH}

Note: If you were to login as a different user or logout these settings will be whipped out so you won’t be able to run any mvn commands. To prevent this, you can append these export statements to the end of your ~/.bashrc file:

#append the export statements
vi ~/.bashrc
#apply the export statements
source ~/.bashrc


Install NodeJS

Note: Steps referenced from https://nodejs.org/en/download/package-manager/

curl --silent --location https://rpm.nodesource.com/setup_5.x | bash -

yum install -y nodejs
Install Dependencies

Note: Used for Zeppelin Web App

yum install -y bzip2 fontconfig

Install Apache Zeppelin

Select the version you would like to install

View the available releases and select the latest:

https://github.com/apache/zeppelin/releases

Override the {APACHE_ZEPPELIN_VERSION} placeholder with the value you would like to use.


Download Apache Zeppelin
cd /opt/
wget https://github.com/apache/zeppelin/archive/{APACHE_ZEPPELIN_VERSION}.zip
unzip {APACHE_ZEPPELIN_VERSION}.zip
ln -s /opt/zeppelin-{APACHE_ZEPPELIN_VERSION-WITHOUT_V_INFRONT} /opt/zeppelin
rm {APACHE_ZEPPELIN_VERSION}.zip
Get Build Variable Values
Get Spark Version

Running the following command

spark-submit --version

Override the {SPARK_VERSION} placeholder with this value.

Example: 1.6.0

Get Hadoop Version

Running the following command

hadoop version

Override the {HADOOP_VERSION} placeholder with this value.

Example: 2.6.0-cdh5.9.0

Take the this value and get the major and minor version of Hadoop. Override the {SIMPLE_HADOOP_VERSION} placeholder with this value.

Example: 2.6

Build Apache Zeppelin

Update the bellow placeholders and run

cd /opt/zeppelin
mvn clean package -Pspark-{SPARK_VERSION} -Dhadoop.version={HADOOP_VERSION} -Phadoop-{SIMPLE_HADOOP_VERSION} -Pvendor-repo -DskipTests

Note: this process will take a while

 

Configure Apache Zeppelin

Base Zeppelin Configuration
Setup Conf
cd /opt/zeppelin/conf/
cp zeppelin-env.sh.template zeppelin-env.sh
cp zeppelin-site.xml.template zeppelin-site.xml
Setup Hive Conf
# note: verify that the path to your hive-site.xml is correct
ln -s /etc/hive/conf/hive-site.xml /opt/zeppelin/conf/
Edit zeppelin-env.sh

Uncomment export HADOOP_CONF_DIR
Set it to export HADOOP_CONF_DIR=“/etc/hadoop/conf”

Starting/Stopping Apache Zeppelin

Start Zeppelin
/opt/zeppelin/bin/zeppelin-daemon.sh start
Restart Zeppelin
/opt/zeppelin/bin/zeppelin-daemon.sh restart
Stop Zeppelin
/opt/zeppelin/bin/zeppelin-daemon.sh stop
Viewing Web UI

Once the zeppelin process is running you can view the WebUI by opening a web browser and navigating to:

http://{HOST}:8080/

Note: Network rules will need to allow this communication

Runtime Apache Zeppelin Configuration

Further configurations maybe needed for certain operations to work

Configure Hive in Zeppelin
  1. Open the cloudera manager and get the public host name of the machine that has the HiveServer2 role. Identify this as HIVESERVER2_HOST
  2. Open the Web UI and click the Interpreter tab
  3. Change the Hive default.url option to: jdbc:hive2://{HIVESERVER2_HOST}:10000