Lars Haugseth
Lars Haugseth

Reputation: 14881

Clear and backfill specific instances for DAGs scheduled to run hourly?

We have an Airflow DAG running on an hourly schedule, with tasks updating and overwriting date-partitioned tables in BigQuery.

After making adjustments to the queries and/or schemas of these tables, we want to backfill several days' worth of existing partitions, but to backfill all runs is a huge waste of effort, since each hourly run will just overwrite the same partition 24 times before moving on to the next day.

We can use airflow list_dag_runs to list all runs and filter out the last one for each day, but is there a way to backfill/clear only these last runs per day without rerunning 24 instances every day?

The airflow clear and airflow backfill commands have options to specify start and end dates, but not specific instance times, so they will cause 24 reruns per date which will all perform the exact same work.

We could use airflow trigger_dag to trigger the DAG manually once per day, but then we will rerun the whole DAG even when there's only one task out of many that we need to backfill for.

Upvotes: 0

Views: 1928

Answers (1)

SergiyKolesnikov
SergiyKolesnikov

Reputation: 7815

You can clear a specific dag run by specifying its execution date as --start_date and --end_date for the clear command.

For example, for these three dag runs:

$ airflow list_dag_runs a_dag

------------------------------------------------------------------------------------------------------------------------
DAG RUNS
------------------------------------------------------------------------------------------------------------------------
id  | run_id               | state      | execution_date       | state_date           |

3   | scheduled__2020-06-11T16:20:00+00:00 | success    | 2020-06-11T16:20:00+00:00 | 2020-06-11T16:25:00.904470+00:00 |
2   | scheduled__2020-06-11T16:15:00+00:00 | success    | 2020-06-11T16:15:00+00:00 | 2020-06-11T16:20:00.503473+00:00 |
1   | scheduled__2020-06-11T16:10:00+00:00 | success    | 2020-06-11T16:10:00+00:00 | 2020-06-11T16:17:28.330410+00:00 |

To clear only the dag run with ID 2, execute:

airflow clear --start_date '2020-06-11T16:15:00+00:00' --end_date '2020-06-11T16:15:00+00:00' --no_confirm a_dag

Comparing state_date for the dag runs shows that only dag run with ID 2 was rerun:

$ airflow list_dag_runs a_dag

------------------------------------------------------------------------------------------------------------------------
DAG RUNS
------------------------------------------------------------------------------------------------------------------------
id  | run_id               | state      | execution_date       | state_date           |

3   | scheduled__2020-06-11T16:20:00+00:00 | success    | 2020-06-11T16:20:00+00:00 | 2020-06-11T16:25:00.904470+00:00 |
2   | scheduled__2020-06-11T16:15:00+00:00 | success    | 2020-06-11T16:15:00+00:00 | 2020-06-11T16:27:36.242567+00:00 |
1   | scheduled__2020-06-11T16:10:00+00:00 | success    | 2020-06-11T16:10:00+00:00 | 2020-06-11T16:17:28.330410+00:00 |

Upvotes: 2

Related Questions