Reputation:
these days I'm working on a new ETL project and I wanted to give a try to Airflow as job manager. Me and my colleague are both working on Airflow for the first time and we are following two different approaches: I decided to write python functions (operators like the ones included in the apache-airflow project) while my colleague uses airflow to call external python scripts through BashOperator.
I'd like to know if there is something like a "good practice", if the two approaches are equally good or I should consider one over the other.
To me, the main differences are: - with BashOperator you can call a python script using a specific python environment with specific packages - with BashOperator the tasks are more independent and can be launched manually if airflow goes mad - with BashOperator task to task communication is a bit harder to manage - with BashOperator task errors and failures are harder to manage (how can a bash task know if the task before it failed or succeded?).
What do you think?
Upvotes: 17
Views: 6991
Reputation: 136
TaskA checks data availability at source. TaskB process it.
Task A>>Task B
Both tasks use BashOperator to call python scripts. I used to return sys.exit(1) (when no data at source) from script1 triggered by TaskA as a way to communicate Task A failed as there is no data and no need to run task B.
Upvotes: 1
Reputation: 6548
My personal preference in these cases would be to use a PythonOperator over BashOperator. Here's what I do and why:
setup.py
that includes airflow as a dependency, along with anything else my DAGs require. Airflow services are run from a virtualenv that installs these dependencies. This handles the python environment you mentioned regarding BashOperator.raise AirflowSkipException
.FYI for BashOperator, if the script exits with an error code, Airflow will mark the task as failed.
Upvotes: 10