How does Spark prepare executors on Hadoop YARN?

Question

I'm trying to understand the details of how Spark prepares the executors. In order to do this I tried to debug org.apache.spark.executor.CoarseGrainedExecutorBackend and invoked

Thread.currentThread().getContextClassLoader.getResource("")

It points out to the following directory:

/hadoop/yarn/local/usercache/_MY_USER_NAME_/appcache/application_1507907717252_15771/container_1507907717252_15771_01_000002/

Looking at the directory I found the following files:

default_container_executor_session.sh
default_container_executor.sh
launch_container.sh
__spark_conf__
__spark_libs__

The question is who delivers the files to each executor and then just runs CoarseGrainedExecutorBackend with the appropriate classpath? What are the scripts? Are they all YARN-autogenerated?

I looked at org.apache.spark.deploy.SparkSubmit, but didn't find anything useful inside.

Jacek Laskowski · Accepted Answer

Ouch...you're asking for quite a lot of details on how Spark communicates with cluster managers while requesting resources. Let me give you some information. Keep asking if you want more...

You are using Hadoop YARN as the cluster manager for Spark applications. Let's focus on this particular cluster manager only (as there are others that Spark supports like Apache Mesos, Spark Standalone, DC/OS and soon Kubernetes that have their own ways to deal with Spark deployments).

By default, while submitting a Spark application using spark-submit, the Spark application (i.e. the SparkContext it uses actually) requests three YARN containers. One container is for that Spark application's ApplicationMaster that knows how to talk to YARN and request two other YARN containers for two Spark executors.

You could review the YARN official documentation's Apache Hadoop YARN and Hadoop: Writing YARN Applications to dig deeper into the YARN internals.

While submitting the Spark application, Spark's ApplicationMaster is submitted to YARN using the YARN "protocol" that requires that the request for the very first YARN container (container 0) uses ContainerLaunchContext that holds all the necessary launch details (see Client.createContainerLaunchContext).

who delivers the files to each executor

That's how YARN gets told how to launch the ApplicationMaster for the Spark application. While fulfilling the request for a ApplicationMaster container, YARN downloads necessary files which you found in the container's working space.

That's very internal to how any YARN application works on YARN and has (almost) nothing to do with Spark.

The code that's responsible for the communication is in Spark's Client, esp. Client.submitApplication.

and then just runs CoarseGrainedExecutorBackend with the appropriate classpath.

Quoting Mastering Apache Spark 2 gitbook:

CoarseGrainedExecutorBackend is a standalone application that is started in a resource container when (...) Spark on YARN’s ExecutorRunnable is started.

ExecutorRunnable is started when when Spark on YARN's YarnAllocator schedules it in allocated YARN resource containers.

What are the scripts? Are they all YARN-autogenerated?

Kind of.

Some are prepared by Spark as part of a Spark application submission while others are YARN-specific.

Enable DEBUG logging level in your Spark application and you'll see the file transfer.

You can find more information in the Spark official documentation's Running Spark on YARN and the Mastering Apache Spark 2 gitbook of mine.

How does Spark prepare executors on Hadoop YARN?

Answers (1)

Related Questions