GitHub Readme.md
Apache Airflow can be used to create, schedule, and monitor workflows. It is commonly used to define ETL processes. An excellent example of an ETL workflow can be found here
Apache Airflow can be quickly and easily deployed to your own Heroku app by using this Heroku Button:
You will be prompted for a new Fernet key, which can be generated thusly:
dd if=/dev/urandom bs=32 count=1 2>/dev/null | openssl base64
After deployment a login user will need to be created. This can be done using the create_user
command through Heroku bash (documentation)
heroku run bash
airflow create_user -u <username> -p <password> -r <Role> -f <FirstName> -l <LastName> -e <Email>
This is based largely on an excellent article (here) on deploying Apache Airflow onto the Heroku platform, with some minor updates and tweaks.
Install or setup supported python version (I'm using pyenv so I just set the desired version in the project directory):
echo "3.6.4" > .python-version
Create Python virtual environment to install Airflow along with dependencies
python3 -m venv .venv
source .venv/bin/activate
Install airflow, install cryptography module, and set Procfile to init db on initial run
pip install "apache-airflow[postgres, password]"
pip install "cryptography"
pip freeze > requirements.txt
Create a .gitignore
file
echo ".venv/" > .gitignore
Initialize the git repository and create the Heroku app with a postgres add-on:
git init
git add .
git commit -m "initial commit"
heroku create
heroku addons:create heroku-postgresql:hobby-dev
We will use airflow.cfg
for most of our application configuration, but any secure values should be kept as Heroku config variables. The airflow.cfg
in this repository is already making use of the DATABASE_URL
that was assigned when we created the database, but we will need a Fernet key in order to enable encryption for connection passwords stored in the database. You can generate/set one thusly:
heroku config:set AIRFLOW__CORE__FERNET_KEY=`dd if=/dev/urandom bs=32 count=1 2>/dev/null | openssl base64`
We'll also need to set AIRFLOW_HOME
to /app
so that Airflow knows where the airflow.cfg
file is. Otherwise when the database initializes it will do so using sqlite, which on Heroku will only be created on an ephemeral file system that has the lifetime of the dyno running it:
heroku config:set AIRFLOW_HOME=/app
Heroku uses a Procfile
, a text file that indicates which command should be used to start code running. For our initial run we just want to initialize the database, so that's what goes in our Procfile
:
echo "web: airflow initdb" > Procfile
Commit once more and deploy to Heroku. This will build the project on Heroku and run the database initialization command from the Procfile.
git add .
git commit -m "Added configuration files."
git push heroku master
Once deployed, follow the log output and await completion of the database initialization:
heroku logs --tail
Now that the database is initialized, update Procfile
to launch the web server:
echo "web: airflow webserver --port \$PORT" > Procfile
git add .
git commit -m "Modify procfile to launch webserver"
git push heroku master
Now when you launch the app (heroku open
) there should be a logon screen. There is no logon yet, so we need to create a new user. This can be done using the create_user
command through Heroku bash (documentation)
heroku run bash
airflow create_user -u <username> -p <password> -r <Role> -f <FirstName> -l <LastName> -e <Email>
Finally, modify the Procfile
one last time to run both the web server and the scheduler.
echo "web: airflow webserver --port \$PORT --daemon & airflow scheduler" > Procfile
Any DAGs you want to run can go in a dags
subfolder within the project.