pletnes
pletnes

Reputation: 479

When to use PythonOperator with a callback vs. BashOperator with "python mytask.py"

I’ve seen airflow used in some projects recently. I’ve noticed that sometimes, people choose to use the PythonOperator with a callback, and other times the BashOperator with something like «python mytask.py». What’s the rationale for spawning a separate bash process just to run a python function/script? What are the pros and cons of each approach?

Upvotes: 1

Views: 897

Answers (2)

neverMind
neverMind

Reputation: 1784

I also have similar doubt. I have a Gitlab CICD pipeline in which I compile, install requirements and activate the venv through Ansible tasks. My main.py is then executed across several similar DAGs via BashOperator to run on the above activated venv. We have done this, because it's seems that using PythonVirtualVenv will create and compile requirements for a venv for every Dag, every single time, for each DAG run that I have, which seems really not that efficient and not desirable?

I was planning to get rid of bashOperator:

 bash_task = BashOperator(
            task_id='gdp_by_group',
            bash_command="/opt/etl/gdp/venv/bin/python3 " + gdp_cmd + " --g {{ params.DATASET_GROUP }}"
        )

to something like this:

python_task = PythonVirtualenvOperator(
    task_id='gdp_by_group',
    python_callable=gdp_main.run_gdp_main,  # Update with the correct function name
    requirements=["/path/to/requirements.txt"],  # Update with your requirements file
    system_site_packages=False,
    op_kwargs={'g': '{{ params.DATASET_GROUP }}'}
)

but I also feel this is not the best case to replace one with another. What are your thoughts on this PythonVirtualEnv or the matching decorator vs going with BashOperator calling python's through venv with my -py script? Am I thinking wrong?

Upvotes: -1

Lucas M. Uriarte
Lucas M. Uriarte

Reputation: 3121

Probably the script is calling a main function or something similar. In that case and from my point of view, I think you should always run your python functions from a python operator.

The same as is doesn't make much sense to run a bash command from python operator using for example os.system(command) as callable inside the python operator, it doesn't make much sense to execute a script containing a python function from a bash operator.

If you want to execute a python function define somewhere else outside the defined dags folder parsed by airflow, you can simply import it as a module as long as it's accessible in your PYTHONPATH. And you can always add that path where your script with the function is found using sys.path.append('my_path')

For more info about pythonoperator: https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html

Upvotes: 2

Related Questions