SPARK standalone cluster: Executors exit, how to track the source of the error?

Question

I am running a standalone cluster on one machine of 250GB or memory and 40 cores, and several TB or hard-disk space.

I am initialising a cluster of 8 executors each one has 5 cores and 28GB of memory.

I am reading data and my persistence strategy is MEMORY_AND_DISK.

I am reading parquet files, process them, and generate a DataFrame then pass it to a pipeline to extract features and train a Random Forest classifier.

While generating the DataFrame I am loosing executors but I am not able to spot the reason.

I see errors as the following one:

16/12/15 11:07:30 ERROR TaskSchedulerImpl: Lost executor 3 on XXXX: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
16/12/15 11:07:30 WARN TaskSetManager: Lost task 172.0 in stage 171.0 (TID 7757, XXXX): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

I have looked to stderr log of the executor on Spark UI, but I couldn't spot anything (INFO logging level is enabled), there is only INFO messages without any WARN or ERROR.

I monitor the available memory on the executors (again using Spark UI) and before the executor exit there is still available memory, and there is plenty of disk space available.

How can I track this issue?
What are the reasons for an executor to exit?

oh54 · Accepted Answer

If you have 8 executors with 28g memory specified each you have just 26g for everything else, the different overheads add up quick and it's entirely possible that this is too little and the executors get killed for hogging the memory.

Try using something like 20g per executor or just generally play around with the values. Are you still losing executors?

SPARK standalone cluster: Executors exit, how to track the source of the error?

Answers (1)

Related Questions