Reputation: 85
We would like to use Apache Airflow to mostly schedule Scrapy Python Spiders, and some other scripts. We will have thousands of spiders, and the scheduling of them can vary, from day to day so we want to be able to create the Airflow dags and schedule them all of them once a day, automatically from a database. The only examples I have seen for airflow use python scripts to write the DAG files.
How is the best way to create the dag files and scheduling automatically?
EDIT: I Managed to find a solution which should work, using YAML files https://codeascraft.com/2018/11/14/boundary-layer%E2%80%89-declarative-airflow-workflows/
Upvotes: 2
Views: 809
Reputation: 10020
Airflow can be used in thousands of dynamic tasks, but it should not. Airflow DAGs are supposed to be pretty constant. You still can use Airflow, for example, to process the whole bunch of scraped data and use this info in your ETL process later.
Large amount of dynamic tasks can lead to DAG runs like it:
Which leads to many garbage info both in GUI and in log files.
But if you really want to use only Airflow, you can read this article (about dynamic DAG generation) and this article (about dynamic tasks generation inside the DAG).
Upvotes: 2