Reputation: 1813
I I have run a python script like this:
spark-submit \
--master yarn \
--deploy-mode client \
--driver-memory 2G \
--driver-cores 2 \
--executor-memory 8G \
--num-executors 3 \
--executor-cores 3 \
script.py
And I get logs like this:
spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
[Stage 1:=================================================> (13 + 2) / 15]18/04/13 13:49:18 ERROR YarnScheduler: Lost executor 3 on serverw19.domain: Container killed by YARN for exceeding memory limits. 12.0 GB of 12 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
[Stage 1:=====================================================> (14 + 1) / 15]18/04/13 14:01:43 ERROR YarnScheduler: Lost executor 1 on serverw51.domain: Container killed by YARN for exceeding memory limits. 12.0 GB of 12 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
[Stage 1:====================================================> (14 + -1) / 15]18/04/13 14:02:48 ERROR YarnScheduler: Lost executor 2 on serverw15.domain: Container killed by YARN for exceeding memory limits. 12.0 GB of 12 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
[Stage 1:====================================================> (14 + -8) / 15]18/04/13 14:02:49 ERROR YarnScheduler: Lost an executor 2 (already removed): Pending loss reason.
[Stage 1:=======================================================(26 + -11) / 15]18/04/13 14:29:53 ERROR YarnScheduler: Lost executor 5 on serverw38.domain: Container killed by YARN for exceeding memory limits. 12.0 GB of 12 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
[Stage 1:=======================================================(28 + -13) / 15]^[18/04/13 14:43:35 ERROR YarnScheduler: Lost executor 6 on serverw10.domain: Slave lost
18/04/13 14:43:35 ERROR TransportChannelHandler: Connection to serverw22.domain/10.252.139.122:54308 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
[Stage 1:=======================================================(28 + -15) / 15]18/04/13 14:44:22 ERROR TransportClient: Failed to send RPC 9128980605450004417 to serverw22.domain/10.252.139.122:54308: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
18/04/13 14:44:22 ERROR YarnScheduler: Lost executor 4 on serverw36.domain: Slave lost
[Stage 1:=======================================================(31 + -25) / 15]18/04/13 15:05:11 ERROR TransportClient: Failed to send RPC 7766740408770504900 to serverw22.domain/10.252.139.122:54308: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
18/04/13 15:05:11 ERROR YarnScheduler: Lost executor 7 on serverw38.domain: Slave lost
[Stage 1:=======================================================(31 + -25) / 15]
Regards
Pawel
Upvotes: 1
Views: 243
Reputation: 1725
What does values in backets mean? (13 + 2)/15 later (28 + -13)/15 etc and finaly (31 + -25) / 15
The first number is the number of partitions that have finished for the current operation.
The second number is the number of partitions that are currently being processed. If the number is negative, that means that the partitions' results are invalid and must be recomputed.
Finally, the last number if the total number of partitions that the current operation has.
Why executors are lost?
As the log error messages say, the tasks are using more memory than what the executors have physically allocated.
Is this application dead and I should kill it or it will finish succesfuly?
Usually Spark should either be able to finish the application (whether it ends successfully or in error). However, in this case I would not have too much hope on it finishing successfully anyway - so if I were you I would just kill it and review the memory settings.
Upvotes: 1