tkarahan
tkarahan

Reputation: 335

Airflow Execution Date is Confusing

I'm studying Airflow documentation to understand better its scheduler mechanism. I came across example below.

In the doc it is stated that when DAG is picked by scheduler on 2016-01-02 at 6 AM, a single DAG Run will be created, with an execution_date of 2016-01-01, and the next one will be created just after midnight on the morning of 2016-01-03 with an execution date of 2016-01-02.

Schedule interval is provided as hourly, and execution date refers to start of the period in which DAG is run at the end, so why it isn't just one hour before the 2016-01-02 at 6 AM at which scheduler picks the DAG?

"""
Code that goes along with the Airflow tutorial located at:
https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta


default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2015, 12, 1),
    'email': ['[email protected]'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    'schedule_interval': '@hourly',
}

dag = DAG('tutorial', catchup=False, default_args=default_args)

I created a basic DAG, and its run info is in the picture below. I gave schedule_interval as 50 * * * *. When Scheduler pick the DAG clock was about 10:58, so it already passed 10:50. DAG was triggered immediately, and because it already passed 10:50, its execution date was given 2021-04-25 09:50. So its execution date is also in the day it is triggered, because it is scheduled at minute 50 for each hour.

In airflow @hourly corresponds to 0 * * * *. Its schedule also similar. It is triggered at minute 0 for each hour, but in the doc its execution date is given as 2016-01-01. I think it must have been 2016-01-02 5PM, because its triggered in each hour, and when it is triggered in 6PM, its start date of the interval is 2016-01-02 5PM.

dag run

Upvotes: 8

Views: 18338

Answers (1)

Elad Kalif
Elad Kalif

Reputation: 15931

Airflow run DAGs at the end of the interval. Thus when you work with 24 hours interval the run of 2016-01-01 will start on 2016-01-02. This is consistent with data pipelines authoring. Today you are processing yesterday data.

You can read more about it in the following answers:

https://stackoverflow.com/a/65196624/14624409

https://stackoverflow.com/a/66288641/14624409

Upvotes: 8

Related Questions