vrivesmolina
vrivesmolina

Reputation: 41

How to run Airflow on a specific day_of_month at a certain time?

I am trying to run Airflow on the 2nd of every month at 11.00am, but I am failing to do so. My settings are:

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': today_date,
    'email': ['mymail'],
    'email_on_failure': True,
    'email_on_retry': True,
    'retries': 1,
    'retry_delay': timedelta(minutes=7),
}

dag = DAG('my_dag', default_args=default_args, schedule_interval='00 11 02 * *')

Airflow works flawlessly when I run a DAG on a daily basis:

schedule_interval='00 11 * * *'

but I don't seem to be able to make it work for a monthly basis :(

thanks!

Upvotes: 4

Views: 12680

Answers (3)

Fedor
Fedor

Reputation: 86

I'm having similar issue and discovered that my understanding of the airflow's schedule_interval as an equivalent of cronjob was wrong! It has same format, yes, but airflow triggers tasks differently than cronjob:

Airflow treats every run as an data interval, which starts at specified schedule_interval and ends before next run. But actual execution of the DAG run is performed on the data interval's end! and for schedule_interval = '@monthly'. For Example: your DAG, scheduled on 11:00 2.MM, e.g. having execution_date=dt.datetime(YYYY,MM,2,11,0,0) will actually be scheduled to run on 2.MM+1.YYYY 10:59:59, e.g. at the end of your month.

(just wait patiently and see your code is working in a month .
update: as mentioned above - since your today_date is dynamic, on every load of your DAG script, the execution timetable will be shifted one month ahead and it never actually runs! you need to use constants.)

DAG Runs - Data Interval

"All dates in Airflow are tied to the data interval concept in some way. The “logical date” (also called execution_date in Airflow versions prior to 2.2) of a DAG run, for example, denotes the start of the data interval, not when the DAG is actually executed.

Similarly, since the start_date argument for the DAG and its tasks points to the same logical date, it marks the start of the DAG’s first data interval, not when tasks in the DAG will start running. In other words, a DAG run will only be scheduled one interval after start_date."

See also: Airflow Start_Date And Execution_Date Explained

So, if you want to run you DAG "today", you need to specify 'start_date': month_ago_date, and if you use execution_date parameter, keep in mind it is equal to data_interval_start, not data_interval_end when the task is actually runs...

Upvotes: 1

Saud Bin Habib
Saud Bin Habib

Reputation: 51

If you want to run your dag on 2nd of every month at 11.00am. you can use this code.

schedule_interval = '0 11 2 * *'

dag_name = DAG(
    'DAG_ID',
    default_args=default_args,
    schedule_interval=schedule_interval,
)

in the schedule interval 0 refers minute, 11 refers hour, 2 refers day of month, * refers any month, and next * refers any day of week.

for more scheduler Information check this website. https://crontab.guru/#0_11_2__

Upvotes: 3

s7anley
s7anley

Reputation: 2498

In the comment you mention, that you use datetime.today() for start_date and that is exactly what's causing the problem. The job instance is started once the period it covers has ended, but in your case that will never happen. Try to adjust start_date to something like:

from datetime import date
from dateutil.relativedelta import relativedelta
start_date = date.today() + relativedelta(months=-1)

I suggest to re-read the Scheduling & Triggers section in the documentation. I took me also a couple of time to get how to correctly schedule DAGs.

Upvotes: 1

Related Questions