Some Name
Some Name

Reputation: 9521

What can cause a stage to reattempt in Spark

I have the following stages in Spark Web page (used with yarn):

enter image description here

The thing I'm surprised by the Stage 0 retry 1, retry 2. What can cause such a thing?

I tried to reproduce it by myself and killed all executor processes (CoarseGrainedExecutorBackend) on one of my cluster machine, but all I got is some failed tasks with the description Resubmitted (resubmitted due to lost executor).

What is the reason of the whole stage retry? And what's I'm curious about is that the number of Records read at each stage attempt was different:

enter image description here

and

enter image description here

Notice the 3011506 in the Attempt 1 and 195907736 in the Attempt 0. Does stage retry cause Spark to re-reads some records twice?

Upvotes: 12

Views: 4455

Answers (2)

Fei YuanXing
Fei YuanXing

Reputation: 11

Fetch Failure: Reduce task is not able to perform shuffle Read i.e. not able to locate shuffle file at disk written shuffle map task.

Upvotes: 1

Shiva Garg
Shiva Garg

Reputation: 916

Stage failure might be due to the FetchFailure in Spark

Fetch Failure: Reduce task is not able to perform shuffle Read i.e. not able to locate shuffle file at disk written shuffle map task.

Spark will retry the stage if stageFailureCount < maxStageFailures otherwise It aborts the stage and corresponding Job.

https://youtu.be/rpKjcMoega0?t=1309

Upvotes: 6

Related Questions