user6300061
user6300061

Reputation:

Apache Airflow Best Practice: (Python)Operators or BashOperators

these days I'm working on a new ETL project and I wanted to give a try to Airflow as job manager. Me and my colleague are both working on Airflow for the first time and we are following two different approaches: I decided to write python functions (operators like the ones included in the apache-airflow project) while my colleague uses airflow to call external python scripts through BashOperator.

I'd like to know if there is something like a "good practice", if the two approaches are equally good or I should consider one over the other.

To me, the main differences are: - with BashOperator you can call a python script using a specific python environment with specific packages - with BashOperator the tasks are more independent and can be launched manually if airflow goes mad - with BashOperator task to task communication is a bit harder to manage - with BashOperator task errors and failures are harder to manage (how can a bash task know if the task before it failed or succeded?).

What do you think?

Upvotes: 17

Views: 6991

Answers (2)

Vpalakkat
Vpalakkat

Reputation: 136

TaskA checks data availability at source. TaskB process it.

Task A>>Task B

Both tasks use BashOperator to call python scripts. I used to return sys.exit(1) (when no data at source) from script1 triggered by TaskA as a way to communicate Task A failed as there is no data and no need to run task B.

Upvotes: 1

Daniel Huang
Daniel Huang

Reputation: 6548

My personal preference in these cases would be to use a PythonOperator over BashOperator. Here's what I do and why:

  • Single repo that contains all my DAGs. This repo also has a setup.py that includes airflow as a dependency, along with anything else my DAGs require. Airflow services are run from a virtualenv that installs these dependencies. This handles the python environment you mentioned regarding BashOperator.
  • I try to put all Python logic unrelated to Airflow in its own externally packaged python library. That code should have its own unit tests and also has its own main so it can be called on the command line independent of Airflow. This addresses your point about when Airflow goes mad!
  • If the logic is small enough that it doesn't make sense to separate into its own library, I drop it in a utils folder in my DAG repo, with unit tests still of course.
  • Then I call this logic in Airflow with the PythonOperator. The python callable can be easily be unit tested, unlike a BashOperator template script. This also means you can access things like starting an Airflow DB session, push multiple values to XCom, etc.
  • Like you mentioned, error handling is a bit easier with Python. You can catch exceptions and check return values easily. You can choose to mark the task as skipped with raise AirflowSkipException.

FYI for BashOperator, if the script exits with an error code, Airflow will mark the task as failed.

Upvotes: 10

Related Questions