Spark assembly file uploaded despite spark.yarn.conf being set

Question

I submit jobs to a Spark cluster running on Yarn using spark-submit sometimes through a relatively slow connection. In order to avoid uploading the 156MB spark-assembly file for each job, I set the configuration option spark.yarn.jar to the file on HDFS. However, this does not avoid the upload, but rather takes the assembly file from the HDFS Spark directory and copies it to the application directory:

$:~/spark-1.4.0-bin-hadoop2.6$ bin/spark-submit --class MyClass --master yarn-cluster --conf spark.yarn.jar=hdfs://node-00b/user/spark/share/lib/spark-assembly.jar my.jar
[...]    
15/07/06 21:25:43 INFO yarn.Client: Uploading resource hdfs://node-00b/user/spark/share/lib/spark-assembly.jar -> hdfs://nameservice1/user/XXX/.sparkStaging/application_1434986503384_0477/spark-assembly.jar

I was expecting that the assembly file should be copied within the HDFS, but actually it seems to be downloaded and uploaded again which is quite counter-productive. Any hints on that?

Saurfang · Accepted Answer

Both HDFS have to be the same system. See relevant codes here:

https://github.com/apache/spark/blob/37bf76a2de2143ec6348a3d43b782227849520cc/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1308

https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1308

Any reason why you can't have spark assembly jar on nameservice1 HDFS instead?

Spark assembly file uploaded despite spark.yarn.conf being set

Answers (1)

Related Questions