Ritvik Chauhan
Ritvik Chauhan

Reputation: 1

How to restart a workflow in Databricks using ADF is there any way to restart the workflow from the point of failure?

I have created a pipeline in ADF which executes a workflow inside databricks but I need to restart the pipeline on failure in such a way that my workflow restarts from the failed task only. Any ideas on how to achieve this?

I tried calling the POST API used for executing workflow again after the failure activity but that isn't working

Upvotes: 0

Views: 590

Answers (1)

Rakesh Govindula
Rakesh Govindula

Reputation: 11529

You can use Repair a job run REST API to re-run the workflow job from the failed task.

https://<databricks instance>.azuredatabricks.net/api/2.1/jobs/runs/repair

After the failure of the workflow, if you are re-running the task first time, there is no need to pass the latest_repair_id in the body of the POST request. For the next re-run, you need to pass the latest_repair_id from the previous re-run POST request.

Go through the below demo. This is my workflow job which was failed at task2.

enter image description here

Use the web activity like below for the first time.

{"run_id":<Failed job run id>,"rerun_tasks":["<Failed task name1>","<Failed task name2>"]}

enter image description here

Pass the repair_id returned by the above web activity @activity('Web rerun first time').output.repair_id when re-running next time. Here, I have stored it in an integer variable and passed to the next web activity.

@json(concat('{"run_id":<Failed job run id>,"rerun_tasks":["<Failed task name1>","<Failed task name2>"],"latest_repair_id":',string(variables('latest_repair_id')),'}'))

enter image description here

When I debug it, my first web activity got succeeded but the second one failed and you can see the reason for the failure.

enter image description here

So, if you are re-running the same job for multiple times from ADF, make sure to add a web activity with a duration more than the execution time of the failed tasks so that the next repair waits till the previous repair completed.

You can see that the number of attempts (original job run + first repair run) are 2(second web activity failed).

enter image description here

Upvotes: 0

Related Questions