makansij
makansij

Reputation: 9875

Spark "ExecutorLostFailure" - how to solve?

I've checked out some of the other answers on "ExecutorLostFailure" and most of them either:

** 1. Don't have an answer** ** 2. Insist on increasing the executor memory and the number of cores **

Here are some of the ones that I'm referring to: here here here

Is there any other solution to this? I've tried both, but it's unclear to me how to correctly gauge how much to allocate for each (memory and cores) in my SparkContext.

The error occurs within a saveAsTextFile action. Thanks.

Upvotes: 0

Views: 2445

Answers (1)

timchap
timchap

Reputation: 513

From my experience, increasing the executor memory can help. But I'd suggest that this is a naive fix, and usually the underlying issue will remain.

The reason I say this is that one of Spark's most important features is that it allows you to perform computations on datasets that are too big to fit in memory. In theory, you could perform most calculations on a 1TB dataset with a single executor with 2GB memory.

In every case that I've encountered an OOM, it has been one of the following two reasons:

1. Insufficient executor memory overhead

This only applies if you are using a resource manager like Mesos or YARN). Check the Spark docs for guidance with this.

2. Something you are doing in your transformations is causing your RDD to become massively "horizontal".

Recall that I said Spark can handle datasets that are too big to fit in memory. The caveat to this is that the datasets must be vertically parallelizable - think a text file with 10^8 rows, where each row contains a relatively small data point (e.g. list of floats, JSON string, a single sentence etc.). Spark will then partition your dataset and send an appropriate number of rows to each executor.

The problem arises when a single row is very large. This is unlikely to occur through normal map-like operations (unless you are doing something quite weird), but is very easy to do through aggregation-type operations like groupByKey or reduceByKey. Consider the following example:

Dataset (name, age):

John    30
Kelly   36
Steve   48
Jane    36

If I then do a groupByKey with the age as key, I will get data in the form:

36  [Kelly, Jane]
30  [John]
48  [Steve]

If the number of rows in the initial dataset is very large, the rows in the resulting dataset could be very long. If they are long enough, they may be too large to fit into executor memory.

The solution?

It depends on your application. In some cases, it may indeed be unavoidable, and you may just have to increase executor memory. But usually it's possible to restructure your algorithm to avoid the issue, e.g. by replacing a groupByKey with a countByKey, or throwing away data points with a very high incidence rate (in one case I observed, it was a bot generating millions of requests that was responsible for the issue. These could be safely discarded without affecting the analysis).

Upvotes: 2

Related Questions