Apache Spark: what's the designed behavior if master fails

Question

We are running our calculations in a standalone Spark cluster, ver 1.0.2 - the previous major release. We do not have any HA or recovery logic configured. A piece of functionality on the driver side consumes incoming JMS messages and submits respective jobs to spark.

When we bring the single & only Spark master down (for tests), it seems the driver program is unable to properly figure out that the cluster is no longer usable. This results in 2 major problems:

The driver tries to reconnect endlessly to the master, or at least we couldn't wait until it gives up.
Because of the previous point, submission of new jobs blocks (in org.apache.spark.scheduler.JobWaiter#awaitResult). I presume this is because the cluster is not reported unreacheable/down and the submission simply logic waits until the cluster comes back. For us this means that we run out of the JMS listener threads very fast since they all get blocked.

There are a couple of akka failure detection-related properties that you can configure on Spark, but:

The official documentation strongly doesn't recommend enabling akka's built-in failure detection.
I would really want to understand how this is supposed to work by default.

So, can anyone please explain what's the designed behavior if a single spark master in a standalone deployment mode fails/stops/shuts down. I wasn't able to find any proper doc on the internet about this.

Apache Spark: what's the designed behavior if master fails

Answers (1)

Related Questions

Apache Spark: what&#39;s the designed behavior if master fails

Answers (1)

Related Questions

Apache Spark: what's the designed behavior if master fails