Reputation: 9875
I've checked out some of the other answers on "ExecutorLostFailure" and most of them either:
** 1. Don't have an answer** ** 2. Insist on increasing the executor memory and the number of cores **
Here are some of the ones that I'm referring to: here here here
Is there any other solution to this? I've tried both, but it's unclear to me how to correctly gauge how much to allocate for each (memory and cores) in my SparkContext
.
The error occurs within a saveAsTextFile
action. Thanks.
Upvotes: 0
Views: 2445
Reputation: 513
From my experience, increasing the executor memory can help. But I'd suggest that this is a naive fix, and usually the underlying issue will remain.
The reason I say this is that one of Spark's most important features is that it allows you to perform computations on datasets that are too big to fit in memory. In theory, you could perform most calculations on a 1TB dataset with a single executor with 2GB memory.
In every case that I've encountered an OOM, it has been one of the following two reasons:
This only applies if you are using a resource manager like Mesos or YARN). Check the Spark docs for guidance with this.
Recall that I said Spark can handle datasets that are too big to fit in memory. The caveat to this is that the datasets must be vertically parallelizable - think a text file with 10^8 rows, where each row contains a relatively small data point (e.g. list of floats, JSON string, a single sentence etc.). Spark will then partition your dataset and send an appropriate number of rows to each executor.
The problem arises when a single row is very large. This is unlikely to occur through normal map
-like operations (unless you are doing something quite weird), but is very easy to do through aggregation-type operations like groupByKey
or reduceByKey
. Consider the following example:
Dataset (name, age):
John 30
Kelly 36
Steve 48
Jane 36
If I then do a groupByKey
with the age as key, I will get data in the form:
36 [Kelly, Jane]
30 [John]
48 [Steve]
If the number of rows in the initial dataset is very large, the rows in the resulting dataset could be very long. If they are long enough, they may be too large to fit into executor memory.
It depends on your application. In some cases, it may indeed be unavoidable, and you may just have to increase executor memory. But usually it's possible to restructure your algorithm to avoid the issue, e.g. by replacing a groupByKey
with a countByKey
, or throwing away data points with a very high incidence rate (in one case I observed, it was a bot generating millions of requests that was responsible for the issue. These could be safely discarded without affecting the analysis).
Upvotes: 2