Reputation: 27375
I'm trying to understand the details of how Spark prepares the executors. In order to do this I tried to debug org.apache.spark.executor.CoarseGrainedExecutorBackend
and invoked
Thread.currentThread().getContextClassLoader.getResource("")
It points out to the following directory:
/hadoop/yarn/local/usercache/_MY_USER_NAME_/appcache/application_1507907717252_15771/container_1507907717252_15771_01_000002/
Looking at the directory I found the following files:
default_container_executor_session.sh
default_container_executor.sh
launch_container.sh
__spark_conf__
__spark_libs__
The question is who delivers the files to each executor and then just runs CoarseGrainedExecutorBackend
with the appropriate classpath? What are the scripts? Are they all YARN-autogenerated?
I looked at org.apache.spark.deploy.SparkSubmit
, but didn't find anything useful inside.
Upvotes: 2
Views: 579
Reputation: 74619
Ouch...you're asking for quite a lot of details on how Spark communicates with cluster managers while requesting resources. Let me give you some information. Keep asking if you want more...
You are using Hadoop YARN as the cluster manager for Spark applications. Let's focus on this particular cluster manager only (as there are others that Spark supports like Apache Mesos, Spark Standalone, DC/OS and soon Kubernetes that have their own ways to deal with Spark deployments).
By default, while submitting a Spark application using spark-submit
, the Spark application (i.e. the SparkContext
it uses actually) requests three YARN containers. One container is for that Spark application's ApplicationMaster
that knows how to talk to YARN and request two other YARN containers for two Spark executors.
You could review the YARN official documentation's Apache Hadoop YARN and Hadoop: Writing YARN Applications to dig deeper into the YARN internals.
While submitting the Spark application, Spark's ApplicationMaster is submitted to YARN using the YARN "protocol" that requires that the request for the very first YARN container (container 0) uses ContainerLaunchContext
that holds all the necessary launch details (see Client.createContainerLaunchContext).
who delivers the files to each executor
That's how YARN gets told how to launch the ApplicationMaster
for the Spark application. While fulfilling the request for a ApplicationMaster container, YARN downloads necessary files which you found in the container's working space.
That's very internal to how any YARN application works on YARN and has (almost) nothing to do with Spark.
The code that's responsible for the communication is in Spark's Client
, esp. Client.submitApplication.
and then just runs
CoarseGrainedExecutorBackend
with the appropriate classpath.
Quoting Mastering Apache Spark 2 gitbook:
CoarseGrainedExecutorBackend
is a standalone application that is started in a resource container when (...) Spark on YARN’sExecutorRunnable
is started.
ExecutorRunnable
is started when when Spark on YARN's YarnAllocator
schedules it in allocated YARN resource containers.
What are the scripts? Are they all YARN-autogenerated?
Kind of.
Some are prepared by Spark as part of a Spark application submission while others are YARN-specific.
Enable DEBUG
logging level in your Spark application and you'll see the file transfer.
You can find more information in the Spark official documentation's Running Spark on YARN and the Mastering Apache Spark 2 gitbook of mine.
Upvotes: 4