user10439725
user10439725

Reputation: 137

Spark keeps relaunching executors after yarn kills them

I was testing with spark yarn cluster mode. The spark job runs in lower priority queue. And its containers are preempted when a higher priority job comes. However it relaunches the containers right after being killed. And higher priority app kills them again. So apps are stuck in this deadlock.

Infinite retry of executors is discussed here. Found below trace in logs.

2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.

So it seems any retry count I set is not even considered. Is there a flag to indicate that all failures in executor should be counted, and job should fail when maxFailures happen ?

spark version 2.11

Upvotes: 4

Views: 1379

Answers (1)

user10439725
user10439725

Reputation: 137

Spark distinguishes between code throwing some exception and external issues, ie code failures and container failures. But spark does not consider preemption as container failure.

See ApplicationMaster.scala, here spark decides to quit if container failure limit is hit. It gets number of failed executors from YarnAllocator. YarnAllocator updates its failed containers in some cases. But not for preemptions, see case ContainerExitStatus.PREEMPTED in same function.

We use spark 2.0.2, where code is slightly different but logic is same. Fix seems to update failed containers collection for preemptions too.

Upvotes: 1

Related Questions