My questions : What is a good directory structure in order to organize your dags and tasks? (the dags examples show only couple of tasks) I currently have my dags at the root of the dags folder and my tasks in separate directories, not sure is the way to do it ? Should we use zip files ? https://github.com/apache/incubator-airflow/blob/a1f4227bee1a70531cfa90769149322513cb6f92/airflow/models.py#L280

I would love to benchmark folder structure with other people as well. Maybe it will depend on what you are using Airflow to but I will share my case. I am doing data pipelines to build a data warehouse so in high level I basically have two steps: Dump a lot of data into a data-lake (directly accessible only to a few people) Load data from data lake into a analytic database where the data will be modeled and exposed to dashboard applications (many sql queries to model the data) Today I organize the files into three main folders that try to reflect the logic above: ├── dags │ ├── dag_1.py │ └── dag_2.py ├── data-lake │ ├── data-source-1 │ └── data-source-2 └── dw ├── cubes │ ├── cube_1.sql │ └── cube_2.sql ├── dims │ ├── dim_1.sql │ └── dim_2.sql └── facts ├── fact_1.sql └── fact_2.sql This is more or less my basic folder structure.

airflow

nono

Reputation: 2441

Airflow structure/organization of Dags and tasks

My questions :

What is a good directory structure in order to organize your dags and tasks? (the dags examples show only couple of tasks)
I currently have my dags at the root of the dags folder and my tasks in separate directories, not sure is the way to do it ?
Should we use zip files ? https://github.com/apache/incubator-airflow/blob/a1f4227bee1a70531cfa90769149322513cb6f92/airflow/models.py#L280

Upvotes: 39

Answers (3)

p13rr0m

Reputation: 1297

I am using Google Cloud Composer. I have to manage multiple projects with some additional SQL scripts and I want to sync everything via gsutil rsync Hence I use the following structure:

├───dags
│   │
│   ├───project_1
│       │
│       ├───dag_bag.py
│       │
│       ├───.airflowignore
│       │
│       ├───dag_1
│       │      dag.py
│       │      script.sql
│
├───plugins
│   │
│   ├───hooks
│   │      hook_1.py
│   │   
│   ├───sensors
│   │      sensor_1.py
│   │
│   ├───operators
│   │      operator_1.py

And the file dag_bag.py containes these lines

from airflow.models import DagBag

dag_bag = DagBag(dag_folder="/home/airflow/gcs/dags/project_1", include_examples=False)

Upvotes: 1

trench

Reputation: 5355

I use something like this.

A project is normally something completely separate or unique. Perhaps DAGs to process files that we receive from a certain client which will be completely unrelated to everything else (almost certainly a separate database schema)
I have my operators, hooks, and some helper scripts (delete all Airflow data for a certain DAG, etc.) in a common folder
I used to have a single git repository for the entire Airflow folder, but now I have a separate git per project (makes it more organized and easier to grant permissions on Gitlab since projects are so unrelated). This means that each project folder also as a .git and .gitignore, etc as well
I tend to save the raw data and then 'rest' a modified copy of the data which is exactly what gets copied into the database. I have to heavily modify some of the raw data due to different formats from different clients (Excel, web scraping, HTML email scraping, flat files, queries from SalesForce or other database sources...)

Example tree:

├───dags
│   ├───common
│   │   ├───hooks
│   │   │       pysftp_hook.py
│   │   │
│   │   ├───operators
│   │   │       docker_sftp.py
│   │   │       postgres_templated_operator.py
│   │   │
│   │   └───scripts
│   │           delete.py
│   │
│   ├───project_1
│   │   │   dag_1.py
│   │   │   dag_2.py
│   │   │
│   │   └───sql
│   │           dim.sql
│   │           fact.sql
│   │           select.sql
│   │           update.sql
│   │           view.sql
│   │
│   └───project_2
│       │   dag_1.py
│       │   dag_2.py
│       │
│       └───sql
│               dim.sql
│               fact.sql
│               select.sql
│               update.sql
│               view.sql
│
└───data
    ├───project_1
    │   ├───modified
    │   │       file_20180101.csv
    │   │       file_20180102.csv
    │   │
    │   └───raw
    │           file_20180101.csv
    │           file_20180102.csv
    │
    └───project_2
        ├───modified
        │       file_20180101.csv
        │       file_20180102.csv
        │
        └───raw
                file_20180101.csv
                file_20180102.csv

Update October 2021. I have a single repository for all projects now. All of my transformation scripts are in the plugins folder (which also contains hooks and operators - basically any code which I import into my DAGs). DAG code I try to keep pretty bare so it basically just dictates the schedules and where data is loaded to and from.

├───dags
│   │
│   ├───project_1
│   │     dag_1.py
│   │     dag_2.py
│   │
│   └───project_2
│         dag_1.py
│         dag_2.py
│
├───plugins
│   ├───hooks
│   │      pysftp_hook.py
|   |      servicenow_hook.py
│   │   
│   ├───sensors
│   │      ftp_sensor.py
|   |      sql_sensor.py
|   |
│   ├───operators
│   │      servicenow_to_azure_blob_operator.py
│   │      postgres_templated_operator.py
│   |
│   ├───scripts
│       ├───project_1
|       |      transform_cases.py
|       |      common.py
│       ├───project_2
|       |      transform_surveys.py
|       |      common.py
│       ├───common
|             helper.py
|             dataset_writer.py
| .airflowignore
| Dockerfile
| docker-stack-airflow.yml

Upvotes: 54

fernandosjp

Reputation: 2778

I would love to benchmark folder structure with other people as well. Maybe it will depend on what you are using Airflow to but I will share my case. I am doing data pipelines to build a data warehouse so in high level I basically have two steps:

Dump a lot of data into a data-lake (directly accessible only to a few people)
Load data from data lake into a analytic database where the data will be modeled and exposed to dashboard applications (many sql queries to model the data)

Today I organize the files into three main folders that try to reflect the logic above:

├── dags
│   ├── dag_1.py
│   └── dag_2.py
├── data-lake
│   ├── data-source-1
│   └── data-source-2
└── dw
    ├── cubes
    │   ├── cube_1.sql
    │   └── cube_2.sql
    ├── dims
    │   ├── dim_1.sql
    │   └── dim_2.sql
    └── facts
        ├── fact_1.sql
        └── fact_2.sql

This is more or less my basic folder structure.

Upvotes: 21

Airflow structure/organization of Dags and tasks

Answers (3)

Related Questions