Reputation: 5029

making Airflow behave like Luigi: how to prevent tasks to be re-run in future runs of a DAG if their output was necessary to be obtained only once?

I come from experiences with Luigi, where if a file was produced successfully by a task and the task was also unmodified, then re-runs of the DAG would not re-run that task, but would reuse its previously-obtained output.

Is there any way to obtain the same behavior with AirFlow?

Currently, if I re-run the dag, it re-executes all the tasks, no matter if they produced a successful (and unchanged) output in the past. So, basically I need a task to be marked as successful if its code was unchanged.

Upvotes: 2

Answers (1)

Jarek Potiuk

Reputation: 20077

This is the crucial and important feature of Airflow to have all the tasks as idempotent. This means that re-running a task on the same input should generally override the output with newly processed version of that data - so that task depending on it can be automatically rerun. But the data might be different after reprocessing than it was originally.

That's why in Airflow you have a backfill command that basically means.

Please re-run this DAG for selected past runs (say last week worth of runs) - but you should JUST reprocess starting from task X (which will re-run task X and ALL tasks that depend on its output).

This also means that when you want to re-run parts of past DAGs but you know that you want to relay on existing output of certain tasks there - you only backfill the tasks that are depending on the output of that task (but not the task itself).

This allows for much more flexibility by defining which tasks in past DAG runs should be re-run (you basically invalidate outputs of certain tasks by making them target of backfill).

This covers more than the case you mention:

a) if you want to not change an output of certain task - you do not backfill that task - but the task(s) that follow from it

b) more importantly - if you want to re-process the task even in the task input and task itself were modified, you can still do it - by backfilling that task.

The case b) is often important, because some of the tasks might have implicit dependencies that change - even if the inputs and task did not change, processing it again might produce different (often better) result.

A good example that I've heard is re-processing call records by telecom operators where you had to determine phone models from IMEI of the phones. In this case you might have a single service that does the mapping, but it might get updated to a newer version when manufacturers refresh their model database - when new phones are introduced, the refresh will happen with some delays, so reprocessing regularly last week's of data might give different results even if the input ("list of calls") and task ("execute map IMEIS to phone models") did not change from the DAG's Python point of view.

Airflow almost always calls external services to run certain tasks, and those services themselves might improve over time - this means that limiting re-processing to the cases where "no input + no task code" has changed is very limiting (but you can still deliberately decide on it by choosing the backfill scope - i.e. which tasks to reprocess).

Upvotes: 1

making Airflow behave like Luigi: how to prevent tasks to be re-run in future runs of a DAG if their output was necessary to be obtained only once?

Answers (1)

Related Questions