Reputation: 12364
I have an initialisation script that downloads a .jar from our local artefact repository and places it into /usr/local/bin
on each node on the cluster. I can run it using
gcloud dataproc jobs submit spark --cluster=my_cluster \
--region=us-central1 --jar=file:///usr/local/bin/myjar.jar -- arg1 arg2
However I'd prefer it if my end users did not have to know the location of the jar.
Where can I put the .jar so that the location of it does not have to be specified?
Upvotes: 3
Views: 4346
Reputation: 10707
For spark jobs, you should be able to just place your jarfiles in /usr/lib/spark/jars
on all nodes to automatically be available on the classpath.
For a more general coverage, you could add your jars to /usr/lib/hadoop/lib
instead; the hadoop lib directory is also automatically included into Spark jobs on Dataproc, and is where libraries such as the GCS connector jarfile reside. You can see the hadoop lib directory being included via the SPARK_DIST_CLASSPATH
environment variable configured in /etc/spark/conf/spark-env.sh
.
If the desired behavior is still to specify using the --jar
flag to specify a "main jar" instead of --jars
to specify library jars that just provide classes, unfortunately there's currently no notion of a "working directory" on the cluster that would allow just specifying relative (instead of absolute) paths to the "main jar". However, there are two approaches that would have similar behavior:
gcloud dataproc jobs delete
later on to clean up GCS space used by those jarfiles--class
instead of the --jar
argument for specifying what job to run after doing the steps above to make the jar available in the Spark classpath already. While the invocation of a classname is a bit more verbose, it still achieves the goal of hiding details of jarfile location from the user.For example, the classes used for "spark-shell" implementation are already on the classpath, so if you wanted to run a scala file as if you were running it through spark-shell
, you could run:
gcloud dataproc jobs submit spark --cluster my-cluster \
--class org.apache.spark.repl.Main \
-- -i myjob.scala
Upvotes: 2