Reputation: 881
I'm thinking of starting to use Apache Airflow for a project and am wondering how people manage continuous integration and dependencies with airflow. More specifically Say I have the following set up
3 Airflow servers: dev staging and production.
I have two python DAG'S whose source code I want to keep in seperate repos. The DAG's themselves are simple, basically just use a Python operator to call main(*args, **kwargs). However the actually code that's run by main is very large and stretches several files/modules. Each python code base has different dependencies for example,
Dag1 uses Python2.7 pandas==0.18.1, requests=2.13.0
Dag2 uses Python3.6 pandas==0.20.0 and Numba==0.27 as well as some cythonized code that needs to be compiled
How do I manage Airflow running these two Dag's with completely different dependencies? Also, how do I manage the continuous integration of the code for both these Dags into each different Airflow enivornment (dev, staging, Prod)(do I just get jenkins or something to ssh to the airflow server and do something like git pull origin BRANCH)
Hopefully this question isn't too vague and people see the problems i'm having.
Upvotes: 7
Views: 2167
Reputation: 1649
We use docker to run the code with different dependencies and DockerOperator in airflow DAG, which can run docker containers, also on remote machines (with docker daemon already running). We actually have only one airflow server to run jobs but more machines with docker daemon running, which the airflow executors call.
For continuous integration we use gitlab CI with the Gitlab container registry for each repository. This should be easily doable with Jenkins.
Upvotes: 3