SPARK 2.4 Standalone + Multiple Workers on a single multi-core server; Submissions are waiting on resources

Question

On a reasonably equipped 64-bit Fedora (home) server with 12-Cores and 64gb-RAM, I have Spark 2.4 running in Standalone mode with the following configuration in ./spark-env.sh (where not shown are the items in that file that I have left commented out):

# =====================================================================
# Options for the daemons used in the standalone deploy mode
# =====================================================================
export SPARK_MASTER_HOST=dstorm
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080 # JupyterLab uses port 8888.
# ---------------------------------------------------------------------
export SPARK_WORKER_CORES=3     # 12  # To Set number of worker cores to use on this machine.
export SPARK_WORKER_MEMORY=4g         # Total RAM workers have to give executors (e.g. 2g)
export SPARK_WORKER_WEBUI_PORT=8081   # Default: 8081
export SPARK_WORKER_INSTANCES=4 # 5   # Number of workers on this server.
# ---------------------------------------------------------------------
export SPARK_DAEMON_MEMORY=1g      # To allocate to MASTER, WORKER and HISTORY daemons themselves (Def: 1g).
# =====================================================================

# =====================================================================
# Generic options for the daemons used in the standalone deploy mode
# =====================================================================
export SPARK_PID_DIR=${SPARK_HOME}/pid # PID file location.
# =====================================================================

After starting the Spark MASTER and WORKERS under this configuration, I then start Jupyter with just two Notebook tabs that point to this Spark Standalone Cluster.

My issue is that just one Notebook tab's worth of cells -- by about the 5th or 6th cell -- consumes all Cores; leaving the second tab starved, stopping all progress in that second tab as it waits for (but never gets) resources. I can confirm this in the SparkUI: A RUNNING status for the first Notebook tab with all cores; and a WAITING status for the second tab with 0-Cores. This, despite that fact that the first Notebook has completed it's run (i.e. reached the bottom and completed its last cell).

By the way, this waiting is not restricted to Jupyter. If I next start Python/PySpark on the CLI and connect to the same cluster, it has to wait, too.

In all three cases I get a session like this:

spark_sesn = SparkSession.builder.config(conf = spark_conf).getOrCreate()

Note that there is nothing heavy-duty going on in these notebook tabs or on the CLI. On the contrary, it's super light (just for testing).

Did I configure something wrong, or have my underlying distribution concept incorrect? I thought there should be multiplexing here, not blocking. Perhaps it's a session sharing issue? (i.e. .getOrCreate()).

I've tried playing with the combination CORES + WORKER-INSTANCES (e.g. 12 x 5 respectively), but the same issue arises.

Hmmm. Well I will keep investigating (it's time for bed). =:)

Thank you in advance for your inputs and insights.

SPARK 2.4 Standalone + Multiple Workers on a single multi-core server; Submissions are waiting on resources

Answers (1)

Related Questions