Dheemanth Bhat
Dheemanth Bhat

Reputation: 4452

Airflow: Importing decorated Task vs all tasks in a single DAG file?

I recently started using Apache Airflow and one of its new concept Taskflow API. I have a DAG with multiple decorated tasks where each task has 50+ lines of code. So I decided to move each task into a separate file.

After referring stackoverflow I could somehow move the tasks in the DAG into separate file per task. Now, my question is:

  1. Does both the code samples shown below work same? (I am worried about the scope of the tasks).
  2. How will they share data b/w them?
  3. Is there any difference in performance? (I read Subdags are discouraged due to performance issues, though this is not Subdags just concerned).

All the code samples I see in the web (and in official documentation) put all the tasks in a single file.

Sample 1

import logging
from airflow.decorators import dag, task
from datetime import datetime

default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}

@dag(default_args=default_args, schedule_interval=None)
def No_Import_Tasks():
    # Task 1
    @task()
    def Task_A():
        logging.info(f"Task A: Received param None")
        # Some 100 lines of code
        return "A"

    # Task 2
    @task()
    def Task_B(a):
        logging.info(f"Task B: Received param {a}")
        # Some 100 lines of code
        return str(a + "B")

    a = Task_A()
    ab = Task_B(a)

No_Import_Tasks = No_Import_Tasks()

Sample 2 Folder structure:

- dags
    - tasks
        - Task_A.py
        - Task_B.py
    - Main_DAG.py

File Task_A.py

import logging
from airflow.decorators import task

@task()
def Task_A():
    logging.info(f"Task A: Received param None")
    # Some 100 lines of code
    return "A"

File Task_B.py

import logging
from airflow.decorators import task

@task()
def Task_B(a):
    logging.info(f"Task B: Received param {a}")
    # Some 100 lines of code
    return str(a + "B")

File Main_Dag.py

from airflow.decorators import dag
from datetime import datetime
from tasks.Task_A import Task_A
from tasks.Task_B import Task_B

default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}

@dag(default_args=default_args, schedule_interval=None)
def Import_Tasks():
    a = Task_A()
    ab = Task_B(a)

Import_Tasks_dag = Import_Tasks()

Thanks in advance!

Upvotes: 5

Views: 3896

Answers (1)

Jarek Potiuk
Jarek Potiuk

Reputation: 20097

  1. There is virtually no difference between the two approaches - neither from logic nor performance point of view.

  2. The tasks in Airflow share the data between them using XCom (https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) effectively exchanging data via database (or other external storage). The two tasks in Airflow - does not matter if they are defined in one or many files - can be executed anyway on completely different machines (there is no task affinity in airflow - each task execution is totally separated from other tasks. So it does not matter - again - if they are in one or many Python files.

  3. Performance should be similar. Maybe splitting into several files is very, very little slower but it should totally negligible and possibly even not there at all - depends on the deployment you have the way you distribute files etc. etc., but I cannot imagine this can have any observable impact.

Upvotes: 4

Related Questions