Reputation: 279
We are running spark jobs on Kubernetes (EKS non EMR) using Spark operator. After some time some executors get SIGNAL TERM, an example log from executor:
Feb 27 19:44:10.447 s3a-file-system metrics system stopped.
Feb 27 19:44:10.446 Stopping s3a-file-system metrics system...
Feb 27 19:44:10.329 Deleting directory /var/data/spark-05983610-6e9c-4159-a224-0d75fef2dafc/spark-8a21ea7e-bdca-4ade-9fb6-d4fe7ef5530f
Feb 27 19:44:10.328 Shutdown hook called
Feb 27 19:44:10.321 BlockManager stopped
Feb 27 19:44:10.319 MemoryStore cleared
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
Feb 27 19:44:10.169 block read in memory in 306 ms. row count = 113970
Feb 27 19:44:09.863 at row 0. reading next block
Feb 27 19:44:09.860 RecordReader initialized will read a total of 113970 records.
On the driver side, after 2 minutes the driver stops receiving heartbeats and then decides to kill the executor
Feb 27 19:46:12.155 Asked to remove non-existent executor 37
Feb 27 19:46:12.155 Removal of executor 37 requested
Feb 27 19:46:12.155 Trying to remove executor 37 from BlockManagerMaster.
Feb 27 19:46:12.154 task 2463.0 in stage 0.0 (TID 2463) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.154 Executor 37 on 172.16.52.23 killed by driver.
Feb 27 19:46:12.153 Trying to remove executor 44 from BlockManagerMaster.
Feb 27 19:46:12.153 Asked to remove non-existent executor 44
Feb 27 19:46:12.153 Removal of executor 44 requested
Feb 27 19:46:12.153 Actual list of executor(s) to be killed is 37
Feb 27 19:46:12.152 task 2595.0 in stage 0.0 (TID 2595) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.152 Executor 44 on 172.16.55.46 killed by driver.
Feb 27 19:46:12.152 Requesting to kill executor(s) 37
Feb 27 19:46:12.151 Actual list of executor(s) to be killed is 44
Feb 27 19:46:12.151 Requesting to kill executor(s) 44
Feb 27 19:46:12.151 Removing executor 37 with no recent heartbeats: 160277 ms exceeds timeout 120000 ms
Feb 27 19:46:12.151 Removing executor 44 with no recent heartbeats: 122513 ms exceeds timeout 120000 ms
I tried to understand if we are crossing some resource limit on Kubernetes level but couldn't find something like that. What can I look for to understand the reason for Kubernetes killing executors?
FOLLOW UP:
I missed a log message on the driver side:
Mar 01 21:04:23.471 Disabling executor 50.
and then on the executor side:
Mar 01 21:04:23.348 RECEIVED SIGNAL TERM
I looked at what class is writing the Disabling executor log message and found this class KubernetesDriverEndpoint
, it seems that the onDisconnected
method is called for all these executors and this method calls disableExecutor
in DriverEndpoint
So the question now is why these executors are considered disconnected.
Looking at the explanation from this site
https://books.japila.pl/apache-spark-internals/scheduler/DriverEndpoint/#ondisconnected-callback
it is said there
Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
But I couldn't find any WARN logs on the driver side, any suggestions?
Upvotes: 0
Views: 1733
Reputation: 279
The reason for the executors being killed was that we were running them on spot instances in AWS, and this is how it looks like, the first sign we saw for an executor being killed is this line in it's log
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
Once we moved to on demand instances for the executors as well not a single executor was terminated in 20 hour jobs
Upvotes: 2