yrom1
yrom1

Reputation: 27

Meaing of `schedule_interval=None` and `start_date=airflow.utils.dates.days_ago(n)` in an Airflow DAG?

I don't understand how to interpret the combination of schedule_interval=None and start_date=airflow.utils.dates.days_ago(3) in an Airflow DAG. If the schedule_interval was '@daily', then (I think) the following DAG would wait for the start of the next day, and then run three times once a day, backfilling the days_ago(3). I do know that because schedule_interval=None, it will have to be manually started, but I don't understand the behavior beyond that. What is the point of the days_ago(3)?

dag = DAG(
    dag_id="chapter9_aws_handwritten_digit_classifier",
    schedule_interval=None,
    start_date=airflow.utils.dates.days_ago(3),
)

The example is from https://github.com/BasPH/data-pipelines-with-apache-airflow/blob/master/chapter07/digit_classifier/dags/chapter9_digit_classifier.py

Upvotes: 1

Views: 5111

Answers (1)

Elad Kalif
Elad Kalif

Reputation: 16099

Your confusion is understandable. This is also confusing for the Airflow scheduler which is why using dynamic values for start_date considered a bad practice. To quote from the Airflow FAQ:

We recommend against using dynamic values as start_date

The reason for this is because Airflow calculates DAG scheduling using start_date as base and schedule_interval as period. When reaching the end of the period the DAG is triggered. However when the start_date is dynamic there is a risk that the period will never end because the base always "moving".

To ease your confusion just change the start_date to some static value and then it will make sense to you.

Noting also that the guide that you referred to was written before AIP-39 Richer scheduler_interval was implemented. Starting Airflow 2.2.0 it's much easier to schedule DAGs. You can read about Timetables in the documentation.

Upvotes: 1

Related Questions