Reputation: 9521
I have the following stages in Spark Web page (used with yarn):
The thing I'm surprised by the Stage 0
retry 1, retry 2. What can cause such a thing?
I tried to reproduce it by myself and killed all executor processes (CoarseGrainedExecutorBackend
) on one of my cluster machine, but all I got is some failed tasks with the description Resubmitted (resubmitted due to lost executor)
.
What is the reason of the whole stage retry? And what's I'm curious about is that the number of Records read at each stage attempt was different:
and
Notice the 3011506
in the Attempt 1
and 195907736
in the Attempt 0
. Does stage retry cause Spark to re-reads some records twice?
Upvotes: 12
Views: 4455
Reputation: 11
Fetch Failure: Reduce task is not able to perform shuffle Read i.e. not able to locate shuffle file at disk written shuffle map task.
Upvotes: 1
Reputation: 916
Stage failure might be due to the FetchFailure in Spark
Fetch Failure: Reduce task is not able to perform shuffle Read i.e. not able to locate shuffle file at disk written shuffle map task.
Spark will retry the stage if stageFailureCount < maxStageFailures otherwise It aborts the stage and corresponding Job.
https://youtu.be/rpKjcMoega0?t=1309
Upvotes: 6