Reputation: 137
I was testing with spark yarn cluster mode. The spark job runs in lower priority queue. And its containers are preempted when a higher priority job comes. However it relaunches the containers right after being killed. And higher priority app kills them again. So apps are stuck in this deadlock.
Infinite retry of executors is discussed here. Found below trace in logs.
2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
So it seems any retry count I set is not even considered. Is there a flag to indicate that all failures in executor should be counted, and job should fail when maxFailures happen ?
spark version 2.11
Upvotes: 4
Views: 1379
Reputation: 137
Spark distinguishes between code throwing some exception and external issues, ie code failures and container failures. But spark does not consider preemption as container failure.
See ApplicationMaster.scala
, here spark decides to quit if container failure limit is hit.
It gets number of failed executors from YarnAllocator
.
YarnAllocator
updates its failed containers in some cases. But not for preemptions, see case ContainerExitStatus.PREEMPTED
in same function.
We use spark 2.0.2, where code is slightly different but logic is same. Fix seems to update failed containers collection for preemptions too.
Upvotes: 1