Reputation: 2441
My questions :
Upvotes: 39
Views: 37890
Reputation: 1297
I am using Google Cloud Composer. I have to manage multiple projects with some additional SQL scripts and I want to sync everything via gsutil rsync
Hence I use the following structure:
├───dags
│ │
│ ├───project_1
│ │
│ ├───dag_bag.py
│ │
│ ├───.airflowignore
│ │
│ ├───dag_1
│ │ dag.py
│ │ script.sql
│
├───plugins
│ │
│ ├───hooks
│ │ hook_1.py
│ │
│ ├───sensors
│ │ sensor_1.py
│ │
│ ├───operators
│ │ operator_1.py
And the file dag_bag.py
containes these lines
from airflow.models import DagBag
dag_bag = DagBag(dag_folder="/home/airflow/gcs/dags/project_1", include_examples=False)
Upvotes: 1
Reputation: 5355
I use something like this.
Example tree:
├───dags
│ ├───common
│ │ ├───hooks
│ │ │ pysftp_hook.py
│ │ │
│ │ ├───operators
│ │ │ docker_sftp.py
│ │ │ postgres_templated_operator.py
│ │ │
│ │ └───scripts
│ │ delete.py
│ │
│ ├───project_1
│ │ │ dag_1.py
│ │ │ dag_2.py
│ │ │
│ │ └───sql
│ │ dim.sql
│ │ fact.sql
│ │ select.sql
│ │ update.sql
│ │ view.sql
│ │
│ └───project_2
│ │ dag_1.py
│ │ dag_2.py
│ │
│ └───sql
│ dim.sql
│ fact.sql
│ select.sql
│ update.sql
│ view.sql
│
└───data
├───project_1
│ ├───modified
│ │ file_20180101.csv
│ │ file_20180102.csv
│ │
│ └───raw
│ file_20180101.csv
│ file_20180102.csv
│
└───project_2
├───modified
│ file_20180101.csv
│ file_20180102.csv
│
└───raw
file_20180101.csv
file_20180102.csv
Update October 2021. I have a single repository for all projects now. All of my transformation scripts are in the plugins folder (which also contains hooks and operators - basically any code which I import into my DAGs). DAG code I try to keep pretty bare so it basically just dictates the schedules and where data is loaded to and from.
├───dags
│ │
│ ├───project_1
│ │ dag_1.py
│ │ dag_2.py
│ │
│ └───project_2
│ dag_1.py
│ dag_2.py
│
├───plugins
│ ├───hooks
│ │ pysftp_hook.py
| | servicenow_hook.py
│ │
│ ├───sensors
│ │ ftp_sensor.py
| | sql_sensor.py
| |
│ ├───operators
│ │ servicenow_to_azure_blob_operator.py
│ │ postgres_templated_operator.py
│ |
│ ├───scripts
│ ├───project_1
| | transform_cases.py
| | common.py
│ ├───project_2
| | transform_surveys.py
| | common.py
│ ├───common
| helper.py
| dataset_writer.py
| .airflowignore
| Dockerfile
| docker-stack-airflow.yml
Upvotes: 54
Reputation: 2778
I would love to benchmark folder structure with other people as well. Maybe it will depend on what you are using Airflow to but I will share my case. I am doing data pipelines to build a data warehouse so in high level I basically have two steps:
Today I organize the files into three main folders that try to reflect the logic above:
├── dags
│ ├── dag_1.py
│ └── dag_2.py
├── data-lake
│ ├── data-source-1
│ └── data-source-2
└── dw
├── cubes
│ ├── cube_1.sql
│ └── cube_2.sql
├── dims
│ ├── dim_1.sql
│ └── dim_2.sql
└── facts
├── fact_1.sql
└── fact_2.sql
This is more or less my basic folder structure.
Upvotes: 21