Reputation: 716
Do you guys have any recommended for Composer folder/directories structure? The way it should be structured is different from the way our internal Airflow server is using right now.
Based on Google documentation: https://cloud.google.com/composer/docs/concepts/cloud-storage:
plugins/
: Stores your custom plugins, operators, hooks dags/
: store dags and any data the web server needs to parse a dag. data/
: Stores the data that tasks produce and use.This is an example of how I organize my dags folder:
I had trouble before when I put the key.json file in the data/
folder and the dags cannot be parsed using the keys in the data/
folder. So now I tend to put all the support files in the dags/
folder.
Would the performance of the scheduler be impacted if I put the supported files (sql, keys, schema) for the dag in the dags/
folder? Is there a good use case to use the data/
folder?
It would be helpful if you guys can show me an example of how to structure the composer folder to support multiple projects with different dags, plugins and supported files.
Right now, we only have 1 Github for the entire Airflow folder. Is it better to have a separate git per project?
Thanks!
Upvotes: 3
Views: 6013
Reputation: 236
I had trouble before when I put the key.json file in the data/ folder and the dags cannot be parsed using the keys in the data/ folder. So now I tend to put all the support files in the dags/ folder.
You only need to set the correct configuration to read these files from the data/ or plugins/ folders. The difference between the filesystem between running airflow in composer against running it locally is that the path to these folders changes.
To help you with that, in another post I describe a solution to find the correct path for these folders. I quote my comment from the post:
"If the path I entered does not work for you, you will need to find the path for your Cloud Composer instance. This is not hard to find. In any DAG you could simply log the sys.path variable and see the path printed."
Would the performance of the scheduler be impacted if I put the supported files (sql, keys, schema) for the dag in the dags/ folder? Is there a good use case to use the data/ folder?
Yes, at least the scheduler needs to check if these are python scripts or not. It's not really that much but it does have an impact.
If you solve the issue of reading from the data or plugins folders, you should move these files to those folders.
Right now, we only have 1 Github for the entire Airflow folder. Is it better to have a separate git per project?
If your different projects require different pypi packages, having separate repositories for each one and different airflow environments too would be ideal. The reason for this is that you'll leverage the risk of falling into pypi packages dependencies errors and reduce build times.
In the other hand, if your projects will use the same pypi packages, I'd suggest to keep everything in a single repository until it becomes easier to have every project in a different repo. But having everything in a single repo makes the deployment easier.
Upvotes: 0
Reputation: 431
The impact on the scheduler should be fairly minimal as long as the files you place in the dags folder are not .py
files; however, you can also place the files in the plugins folder which is also synced via copy.
I would use top-level folders to separate projects (e.g. dags/projectA/dagA.py
), or even separate environments if the projects are large enough.
Upvotes: 1