Reputation: 1063
How can I schedule a dag to have a weekday execution date but have a start date the following day, which is not necessarily a weekday?
My rational is that I get data at the end of each business day which I would like to process early the next morning. The airflow common pitfalls describes the execution date as the date the data belongs to while the start date is the date you run your ETL.
For example: I want a series of dag runs to have the following execution and start dates -
DAG start_date Task Started Task execution_date
2018-01-01 2018-01-02 Tues 2018-01-01 Mon
2018-01-03 Wed 2018-01-02 Tues
2018-01-04 Thur 2018-01-03 Wed
2018-01-05 Fri 2018-01-04 Thur
2018-01-06 Sat 2018-01-05 Fri
2018-01-06 Tues 2018-01-08 Mon
The closest I have managed to get to this is by using the schedule: 0 2 * * TUE-SAT
which has the wrong execution date (Saturday) on when started on a Tuesday (see below)
DAG start_date Task Started Task execution_date
2018-01-01 2018-01-03 Wed 2018-01-02 Tues
2018-01-04 Thur 2018-01-03 Wed
2018-01-05 Fri 2018-01-04 Thur
2018-01-06 Sat 2018-01-05 Fri
2018-01-09 Tues 2018-01-06 Sat
or the schedule: 0 2 * * MON-FRI
which does not run Fridays DAG till Monday and I need the results over the weekend.
DAG start_date Task Started Task execution_date
2018-01-01 2018-01-02 Tues 2018-01-01 Mon
2018-01-03 Wed 2018-01-02 Tues
2018-01-04 Thur 2018-01-03 Wed
2018-01-05 Fri 2018-01-04 Thur
2018-01-08 Mon 2018-01-05 Fri
2018-01-06 Tues 2018-01-08 Mon
Upvotes: 4
Views: 5300
Reputation: 6831
First, quoting the Airflow docs:
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date, at the END of the period.
So what's happening here?
Specifying 0 2 * * MON-FRI
means that your periods are:
MON 2AM -> TUE 2AM
TUE 2AM -> WED 2AM
WED 2AM -> THU 2AM
THU 2AM -> FRI 2AM
FRI 2AM -> MON 2AM <- the problem
This means that your desired execution date defines the end of the periods, but your desired data partition follows the start of the period.
Long story short: it's impossible to specify a periodical division of the week such that every period starts with a weekday and ends the day following day. Why? Because there's no period to represent what happens on the weekend.
How can you make a periodical division that works?
0 2 * * TUE-SAT
but don't trust the execution_date
to represent when your next data to be processed starts exactly, but when your past data is deemed already processed.Upvotes: 8