Reputation: 27
I don't understand how to interpret the combination of schedule_interval=None
and start_date=airflow.utils.dates.days_ago(3)
in an Airflow DAG. If the schedule_interval
was '@daily'
, then (I think) the following DAG would wait for the start of the next day, and then run three times once a day, backfilling the days_ago(3)
. I do know that because schedule_interval=None
, it will have to be manually started, but I don't understand the behavior beyond that. What is the point of the days_ago(3)
?
dag = DAG(
dag_id="chapter9_aws_handwritten_digit_classifier",
schedule_interval=None,
start_date=airflow.utils.dates.days_ago(3),
)
The example is from https://github.com/BasPH/data-pipelines-with-apache-airflow/blob/master/chapter07/digit_classifier/dags/chapter9_digit_classifier.py
Upvotes: 1
Views: 5111
Reputation: 16099
Your confusion is understandable. This is also confusing for the Airflow scheduler which is why using dynamic values for start_date considered a bad practice. To quote from the Airflow FAQ:
We recommend against using dynamic values as start_date
The reason for this is because Airflow calculates DAG scheduling using start_date
as base and schedule_interval
as period. When reaching the end of the period the DAG is triggered. However when the start_date
is dynamic there is a risk that the period will never end because the base always "moving".
To ease your confusion just change the start_date to some static value and then it will make sense to you.
Noting also that the guide that you referred to was written before AIP-39 Richer scheduler_interval was implemented. Starting Airflow 2.2.0 it's much easier to schedule DAGs. You can read about Timetables
in the documentation.
Upvotes: 1