Reputation: 2040
I submit jobs to a Spark cluster running on Yarn using spark-submit sometimes through a relatively slow connection. In order to avoid uploading the 156MB spark-assembly file for each job, I set the configuration option spark.yarn.jar
to the file on HDFS. However, this does not avoid the upload, but rather takes the assembly file from the HDFS Spark directory and copies it to the application directory:
$:~/spark-1.4.0-bin-hadoop2.6$ bin/spark-submit --class MyClass --master yarn-cluster --conf spark.yarn.jar=hdfs://node-00b/user/spark/share/lib/spark-assembly.jar my.jar
[...]
15/07/06 21:25:43 INFO yarn.Client: Uploading resource hdfs://node-00b/user/spark/share/lib/spark-assembly.jar -> hdfs://nameservice1/user/XXX/.sparkStaging/application_1434986503384_0477/spark-assembly.jar
I was expecting that the assembly file should be copied within the HDFS, but actually it seems to be downloaded and uploaded again which is quite counter-productive. Any hints on that?
Upvotes: 1
Views: 2049
Reputation: 695
Both HDFS have to be the same system. See relevant codes here:
Any reason why you can't have spark assembly jar on nameservice1 HDFS instead?
Upvotes: 3