Spark ExecutorLostFailure Memory Exceeded

Question

I have been trying to get a spark job to run to completion for several days now and I was finally able to get it to complete but there was still a large number of failed tasks where executors where being killed with the following message:

ExecutorLostFailure (executor 77 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 45.1 GB of 44.9 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead

These are the properties I am passing to the cluster:

[
    {
        "classification": "spark-defaults",
        "properties": {
            "spark.executor.memory": "41000m",
            "spark.driver.memory": "8000m",
            "spark.executor.cores": "6",
            "spark.shuffle.service.enabled": "true",
            "spark.executor.instances": "98",
            "spark.yarn.executor.memoryOverhead": "5000"
        }
    }
]

The cluster comprises of 20 machines each with 32 cores and 240G of memory. Should I just continue to raise the memoryOverhead or is there a point where it indicates a deeper problem. The error this time seemed to occur during a coalesce from 5000 partitions down to 500 before writing the resulting data to S3. I am guessing the coalesce caused a shuffle and since the cluster was already low on memory it pushed it too far.

The workflow is as follows:

Load parquet files from s3 into dataframe
Extract set of unique keys which group the data using sql query against dataframe
Transform the dataframe to a JavaRDD and apply several map functions
MapToPair the data
combineByKey using the below, essentially merges individual objects into arrays of objects by key

combineByKey(new function, add function, merge function, new HashPartitioner(5000), false, null);
More Maps
For each of several unique keys, filter the rdd to get just tuples with that key then persist each of those subsets to disk after coalescing

Another question is how the 44.9 number from above is derived. I figured the max memory would be executor memory + memoryOverhead which would be 46G not 44.9G.

Any help would be greatly appreciated, Nathan

Spark ExecutorLostFailure Memory Exceeded

Answers (1)

Related Questions