What part of Spark return failed task to a different node?

Question

When a task running on a node fails, Spark will automatically return the task to a different node. My Questions are:

What part of Spark is responsible for that? Is it ApplicationMaster, driver, or ResourceManager?
What is the procedure of recovering the failed task?

Harjeet Kumar · Accepted Answer

When a spark task is failed following things happen

Node manager at that machine try to rerun that task on same machine and also informs APPLICATION MASTER.
Based on speculative execution, application master may decide to run a duplicate task on another machine. Resource Manager does not restart the task
Task is restarted from beginning. Since all partitioned processed by that task were lost in that failure that happened earlier. Thats where spark RDD lineage comes in picture. spark takes spark lineage and tries to recreate partition which was lost as part of task failure.

Answers (1)