irbull
irbull

Reputation: 2530

Are failed spark executors a cause for concern?

I understand that Apache Spark is designed around resilient data structures, but are failures expected during a running system or does this typically indicate a problem?

As I begin to scale the system out to different configurations, I see ExecutorLostFailure and No more replicas (See below). The system recovers and the program finishes.

Should I be concerned with this, and are there typically things we can do to avoid this; or is this expected as the number of executors grow?

18/05/18 23:59:00 WARN TaskSetManager: Lost task 87.0 in stage 4044.0 (TID 391338, ip-10-0-0-68.eu-west-1.compute.internal, executor 11): ExecutorLostFailure (executor 11 exited caused by one of the running tasks) Reason: Container marked as failed: container_1526667532988_0010_01_000012 on host: ip-10-0-0-68.eu-west-1.compute.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
18/05/18 23:59:00 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_193_7 !
18/05/18 23:59:00 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_582_50 !
18/05/18 23:59:00 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_401_91 !
18/05/18 23:59:00 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_582_186 !
18/05/18 23:59:00 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_115_139 !

Upvotes: 18

Views: 21047

Answers (2)

Thiago Mata
Thiago Mata

Reputation: 2959

More important than how many failures are you receiving, you should look at the cause of these failures.

If the cause of the failure is related to network problems, it is ok. That is expected on distributed systems. When you have many machines talking to each other, at some moment you will have some communication issues.

But, if the cause of the error is related to resources consuming, then you may have a dangerous problem. In general, all slaves have similar specs. If some job is requiring more resources than available on some slave, probably this will happen again and again in the next slaves. They will keep failing and failing until becoming all unresponsible, in a Domino effect.

In this last case, you may need to rethink and rewrite your code to reduce the among of memory or disk necessary to each step on each slave do the job. Some common improvements are making all the filters before the grouping or changing the grouping by key strategy.

Upvotes: 3

Dimos
Dimos

Reputation: 8928

As I begin to scale the system out to different configurations, I see ExecutorLostFailure and No more replicas (See below). Should I be concerned with this?

You are right, this exception does not necessarily mean that something is wrong about your Spark job, because it will be thrown even in cases, where a server stopped working because of physical reasons (e.g. outage).

However, if you see multiple executor failures in your job, this is probably a signal that something can probably be improved. More specifically, the spark configuration contains a parameter called spark.task.maxFailures, which corresponds to the maximum number of failures for each task, after which a job will be considered as failed. As a result, in a well-behaved Spark job, you might see some executor failures, but they should be rare and you should rarely see a specific task failing multiple times, because then it probably means that it's not the fault of the executor, but the task is extremely heavy to deal with.

Are there typically things we can do to avoid this?

That depends a lot in the nature of your job. However, as said before the usual suspect is that the created task is too heavy for an executor (e.g. in terms of memory required). Spark creates a number of partitions for each RDD, based on several factors, such as the size of your cluster. However, if for example your cluster is quite small, Spark might create partitions that are very big in size and cause problems to the executors. So, you can try re-partitioning the RDDs in your code to enforce more, smaller partitions, which can be processed more easily.

Upvotes: 13

Related Questions