Reputation: 9637
I'm trying to use Python's Airflow library. I want it to scrape a web page periodically.
The issue I'm having is that if my start_date
is several days ago, when I start the scheduler it will backfill from the start_date
to today. For example:
Assume today is the 20th of the month.
Assume the start_date
is the 15th of this month.
If I start the scheduler on the 20th, it will scrape the page 5 times on the 20th. It will see that a DAG instance was suppose to run on the 15th, and will run that DAG instance (the one for the 15th) on the 20th. And then it will run the DAG instance for the 16th on the 20th, etc.
In short, Airflow will try to "catch up", but this doesn't make sense for web scraping.
Is there any way to make Airflow consider a DAG instance failed after a certain time?
Upvotes: 2
Views: 4555
Reputation: 1024
Airflow was designed with the "backfilling" in mind so the roadmap item is against its primary logic.
For now you can update the start_date
for this specific task or the whole dag.
Every operator has a start_date http://pythonhosted.org/airflow/code.html#baseoperator
The scheduler is not made for being stopped. If you run it today you may set your task start_date to today, seeems logic for me.
Upvotes: 1
Reputation: 91
This feature is in the roadmap for Airflow, but does not currently exist.
See: Issue #1155
You may be able to hack together a solution using BranchPythonOperator. As it says in the documentation, make sure you have set depends_on_past=False
(this is the default). I do not have airflow set up so I can't test and provide you example code at this time.
Upvotes: 4