MassyB
MassyB

Reputation: 1184

How is the execution_date of a DagRun set?

Given a DAG having an start_date, which is run at a specific date, how is the execution_date of the corresponding DAGRun defined?

I have read the documentation but one example is confusing me:

"""
Code that goes along with the Airflow tutorial located at:
https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py
"""
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta


default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2015, 12, 1),
    'email': ['[email protected]'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
    'schedule_interval': '@hourly',
}

dag = DAG('tutorial', catchup=False, default_args=default_args)

Assuming that the DAG is run on 2016-01-02 at 6 AM, the first DAGRun will have an execution_date of 2016-01-01 and, as said in the documentation

the next one will be created just after midnight on the morning of 2016-01-03 with an execution date of 2016-01-02

Here is how I would have set the execution_date:

the DAG having its schedule_interval set to every hour and being run on 2016-01-02 at 6 AM, the execution_date of the first DAGRun would have been set to 2016-01-02 at 7 AM, the second to 2016-01-02 at 8 AM ...ect.

Upvotes: 2

Views: 6746

Answers (2)

Dharam
Dharam

Reputation: 153

We get this question a lot from analysts writing airflow dags.

Each dag run covers a period of time with a start & end.

The start = execution_date

The end = when the dag run is created and executed (next_execution_date)

An example that should help:

Schedule interval: '0 0 * * *' (run daily at 00:00:00 UTC)
Start date: 2019-10-01 00:00:00

10/1 00:00           10/2 00:00
*<------------------>*
 < your 1st dag run >
^ execution_date
  next_execution_date^
                     ^when this 1st dag run is actually created by the scheduler

As @simond pointed out in a comment, "execution_date" is a poor name for this variable. It is neither a date nor represents when it was executed. Alas we're stuck with what the creators of airflow gave us... I find it helpful to just use next_execution_date if I want the datetime the dag run will execute my code.

Upvotes: 3

Simon D
Simon D

Reputation: 6279

This is just how scheduling works in Airflow. I think it makes sense to do it the way that Airflow does when you think about how normal ETL batch processes run and how you use the execution_date to pick up delta records that have changed.

Lets say that we want to schedule a batch job to run every night to extract new records from some source database. We want all records that were changed from the 1/1/2018 onwards (we want all records changed on the 1st too). To do this you would set the start_date of the DAG to the 1/1/2018, the scheduler will run a bunch of times but when it gets to 2/1/2018 (or very shortly after) it will run our DAG with an execution_date of 1/1/2018.

Now we can send an SQL statement to the source database which uses the execution_date as part of the SQL using JINJA templating. The SQL would look something like:

SELECT row1, row2, row3
FROM table_name
WHERE timestamp_col >= {{ execution_date }} and timestamp_col < {{ next_execution_date }}

I think when you look at it this way it makes more sense although I admit I had trouble trying to understand this at the beginning.

Here is a quote from the documentation https://airflow.apache.org/scheduler.html:

The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.

Also it's worth noting that the example you're looking at from the documentation is describing the behaviour of the schedule when backfilling is disabled. If backfilling was enabled there would be a DAG run created for every 1 hour interval between 1/12/2015 and the current date if the DAG had never been run before.

Upvotes: 4

Related Questions