preeze
preeze

Reputation: 1101

Apache Spark: what's the designed behavior if master fails

We are running our calculations in a standalone Spark cluster, ver 1.0.2 - the previous major release. We do not have any HA or recovery logic configured. A piece of functionality on the driver side consumes incoming JMS messages and submits respective jobs to spark.

When we bring the single & only Spark master down (for tests), it seems the driver program is unable to properly figure out that the cluster is no longer usable. This results in 2 major problems:

  1. The driver tries to reconnect endlessly to the master, or at least we couldn't wait until it gives up.
  2. Because of the previous point, submission of new jobs blocks (in org.apache.spark.scheduler.JobWaiter#awaitResult). I presume this is because the cluster is not reported unreacheable/down and the submission simply logic waits until the cluster comes back. For us this means that we run out of the JMS listener threads very fast since they all get blocked.

There are a couple of akka failure detection-related properties that you can configure on Spark, but:

  1. The official documentation strongly doesn't recommend enabling akka's built-in failure detection.
  2. I would really want to understand how this is supposed to work by default.

So, can anyone please explain what's the designed behavior if a single spark master in a standalone deployment mode fails/stops/shuts down. I wasn't able to find any proper doc on the internet about this.

Upvotes: 4

Views: 971

Answers (1)

Do Do
Do Do

Reputation: 723

In default, Spark can handle Workers failures but not for the Master (Driver) failure. If the Master crashes, no new applications can be created. Therefore, they provide 2 high availability schemes here: https://spark.apache.org/docs/1.4.0/spark-standalone.html#high-availability

Hope this helps,

Le Quoc Do

Upvotes: 1

Related Questions