Reputation: 2169
I'm trying to see whether or not there is a straightforward way to not start the next dag run if the previous dag run has failures. I already set depends_on_past=True
, wait_for_downstream=True
, max_active_runs=1
.
What i have is tasks 1, 2, 3 where they:
task 3 always runs with trigger_rule=all_done
to make sure we always tear down resources. What i'm seeing is that if task 2 fails, and task 3 then succeeds, the next dag run starts and if i have wait_for_downstream=False
it runs task 1 since the previous task 1 was a success and if i have wait_for_downstream=true
then it doesn't start the dag as i expect which is perfect.
The problem is that if tasks 1 and 2 succeed but task 3 fails for some reason, now my next dag run starts and task 1 runs immediately because both task 1 and task 2 (due to wait_for_downstream
) were successful in the previous run. This is the worst case scenario because task 1 creates resources and then the job is never run so the resources just sit there allocated.
What i ultimately want is for any failure to stop the dag from proceeding to the next dag run. If my previous dag run is marked as fail then the next one should not start at all. Is there any mechanism for doing this?
My current 2 best effort ideas are:
Is there any out of the box solution for this? Seems fairly standard to not want to continue on failure and not want step 1 to start of run 2 if not all steps of run 1 were successful or if run 1 itself was marked as failed.
Upvotes: 2
Views: 8595
Reputation: 1
The ExternalTaskSensor may work, with an execution_delta of datetime.timedelta(days=1). From the docs:
execution_delta (datetime.timedelta) – time difference with the previous execution to look at, the default is the same execution_date as the current task or DAG. For yesterday, use [positive!] datetime.timedelta(days=1). Either execution_delta or execution_date_fn can be passed to ExternalTaskSensor, but not both.
I've only used it to wait for upstream DAG's to finish, but seems like it should work as self-referencing because the dag_id and task_id are arguments for the sensor. But you'll want to test it first of course.
Upvotes: 0
Reputation: 2408
The reason depends_on_past
is not helping your is it's a task parameter not a dag parameter.
Essentially what you're asking for is for the dag to be disabled after a failure.
I can imagine valid use cases for this, and maybe we should add an AirflowDisableDagException
that would trigger this.
The problem with this is you risk having your dag disabled and not noticing for days or weeks.
A better solution would be to build recovery or abort logic into your pipeline so that you don't need to disable the dag.
One way you can do this is add a cleanup task to the start of your dag, which can check whether resources were left sitting there and tear them down if appropriate, and just fail the dag run immediately if you get an appropriate error. You can consider using airflow Variable
or Xcom
to store the state of your resources.
The other option, notwithstanding the risks, is the disable dag approach: if your process fails to tear down resources appropriately, disable the dag. Something along these lines should work:
class MyOp(BaseOperator):
def disable_dag(self):
orm_dag = DagModel(dag_id=self.dag_id)
orm_dag.set_is_paused(is_paused=True)
def execute(self, context):
try:
print('something')
except TeardownFailedError:
self.disable_dag()
Upvotes: 3