Reputation: 523
As per my understanding spark does not need to be installed on all the node in a yarn cluster. Spark installation is only required at the node(usually gateway node) from where spark-submit script is fired.
As per spark programming guide
To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars.
How does libraries containing Spark code (i.e spark runtime jar available in ../spark-2.0.1-bin-hadoop2.6/jars) get distributed to Physical Worker Node(where executor are launched) in a YARN cluster.
Thank You.
Upvotes: 0
Views: 268
Reputation: 523
I had posted this question in cloudera community. Thought of sharing the answer.
When running on Spark, the spark archive gets distributed to worker nodes via the ContainerLocalizer (aka distributed cache). Spark first uploads files to HDFS and then worker nodes can handle downloading the jar when needed from HDFS. The localizer has some checks to only download the jar when it has changed or has been removed from the worker, so it can reuse the jar and not have to download it again if it still exists locally.
Upvotes: 1
Reputation: 652
First, jars are uploaded to hdfs (staging folder) and after that distributed to the local /tmp directory of each nodemanager
Upvotes: 0
Reputation: 12991
You can place the jars on hdfs and then set spark.yarn.jars path to the hdfs position. This should provide spark jars to all nodes.
Note though that if you need to distribute environment variables (e.g. spark-env.sh) then that needs to be on all the nodes.
Upvotes: 0