Where should I put jars on a dataproc cluster so they can be used by gcloud dataproc jobs submit spark?

Question

I have an initialisation script that downloads a .jar from our local artefact repository and places it into /usr/local/bin on each node on the cluster. I can run it using

gcloud dataproc jobs submit spark --cluster=my_cluster \
      --region=us-central1 --jar=file:///usr/local/bin/myjar.jar -- arg1 arg2

However I'd prefer it if my end users did not have to know the location of the jar.

Where can I put the .jar so that the location of it does not have to be specified?

Dennis Huo · Accepted Answer

For spark jobs, you should be able to just place your jarfiles in /usr/lib/spark/jars on all nodes to automatically be available on the classpath.

For a more general coverage, you could add your jars to /usr/lib/hadoop/lib instead; the hadoop lib directory is also automatically included into Spark jobs on Dataproc, and is where libraries such as the GCS connector jarfile reside. You can see the hadoop lib directory being included via the SPARK_DIST_CLASSPATH environment variable configured in /etc/spark/conf/spark-env.sh.

If the desired behavior is still to specify using the --jar flag to specify a "main jar" instead of --jars to specify library jars that just provide classes, unfortunately there's currently no notion of a "working directory" on the cluster that would allow just specifying relative (instead of absolute) paths to the "main jar". However, there are two approaches that would have similar behavior:

Make the jarfiles local to the user's workspace from which jobs are being submitted - gcloud will then upload the jarfile at job-submission time into GCS and point the job at the jarfile when it runs in a job-specific directory. Note that this would cause duplicate uploads of the jarfile into GCS each time the job runs since it's always staging into a unique job directory; you'd have to gcloud dataproc jobs delete later on to clean up GCS space used by those jarfiles
(Preferred approach): Use the --class instead of the --jar argument for specifying what job to run after doing the steps above to make the jar available in the Spark classpath already. While the invocation of a classname is a bit more verbose, it still achieves the goal of hiding details of jarfile location from the user.

For example, the classes used for "spark-shell" implementation are already on the classpath, so if you wanted to run a scala file as if you were running it through spark-shell, you could run:

gcloud dataproc jobs submit spark --cluster my-cluster \
    --class org.apache.spark.repl.Main \
    -- -i myjob.scala

Where should I put jars on a dataproc cluster so they can be used by gcloud dataproc jobs submit spark?

Answers (1)

Related Questions