Reputation: 14881
We have an Airflow DAG running on an hourly schedule, with tasks updating and overwriting date-partitioned tables in BigQuery.
After making adjustments to the queries and/or schemas of these tables, we want to backfill several days' worth of existing partitions, but to backfill all runs is a huge waste of effort, since each hourly run will just overwrite the same partition 24 times before moving on to the next day.
We can use airflow list_dag_runs
to list all runs and filter out the last one for each day, but is there a way to backfill/clear only these last runs per day without rerunning 24 instances every day?
The airflow clear
and airflow backfill
commands have options to specify start and end dates, but not specific instance times, so they will cause 24 reruns per date which will all perform the exact same work.
We could use airflow trigger_dag
to trigger the DAG manually once per day, but then we will rerun the whole DAG even when there's only one task out of many that we need to backfill for.
Upvotes: 0
Views: 1928
Reputation: 7815
You can clear a specific dag run by specifying its execution date as --start_date
and --end_date
for the clear
command.
For example, for these three dag runs:
$ airflow list_dag_runs a_dag
------------------------------------------------------------------------------------------------------------------------
DAG RUNS
------------------------------------------------------------------------------------------------------------------------
id | run_id | state | execution_date | state_date |
3 | scheduled__2020-06-11T16:20:00+00:00 | success | 2020-06-11T16:20:00+00:00 | 2020-06-11T16:25:00.904470+00:00 |
2 | scheduled__2020-06-11T16:15:00+00:00 | success | 2020-06-11T16:15:00+00:00 | 2020-06-11T16:20:00.503473+00:00 |
1 | scheduled__2020-06-11T16:10:00+00:00 | success | 2020-06-11T16:10:00+00:00 | 2020-06-11T16:17:28.330410+00:00 |
To clear only the dag run with ID 2, execute:
airflow clear --start_date '2020-06-11T16:15:00+00:00' --end_date '2020-06-11T16:15:00+00:00' --no_confirm a_dag
Comparing state_date
for the dag runs shows that only dag run with ID 2 was rerun:
$ airflow list_dag_runs a_dag
------------------------------------------------------------------------------------------------------------------------
DAG RUNS
------------------------------------------------------------------------------------------------------------------------
id | run_id | state | execution_date | state_date |
3 | scheduled__2020-06-11T16:20:00+00:00 | success | 2020-06-11T16:20:00+00:00 | 2020-06-11T16:25:00.904470+00:00 |
2 | scheduled__2020-06-11T16:15:00+00:00 | success | 2020-06-11T16:15:00+00:00 | 2020-06-11T16:27:36.242567+00:00 |
1 | scheduled__2020-06-11T16:10:00+00:00 | success | 2020-06-11T16:10:00+00:00 | 2020-06-11T16:17:28.330410+00:00 |
Upvotes: 2