Reputation: 12264
Just starting to get familiar with GCP dataproc. I've noticed when I use gcloud dataproc jobs submit pyspark
that jobs are submitted with spark.submit.deployMode=client
. Is spark.submit.deployMode=cluster
an option for us?
Upvotes: 3
Views: 4185
Reputation: 151
Found here hidden away in a how-to from Google:
By default, Dataproc runs Spark jobs in client mode, and streams the driver output for viewing as explained, below. However, if the user if the user creates the Dataproc cluster by setting cluster properties to --properties: spark:spark.submit.deployMode=cluster or submits the job in cluster mode by setting job properties to --properties spark.submit.deployMode=cluster, driver output is listed in YARN userlogs, which can be accessed in Logging.
However, it's not entirely clear what the difference is between deploying with the cluster mode and submitting a job in cluster-mode is. I'd have to run an experiment, but I might reckon I see the executor logs get streamed to the console output (gathered with the driver logs) if you launch the cluster in client mode. If you launch it in cluster mode, then only the driver logs are viewable in the job-console. If you deploy a job in cluster mode, then nothing is sent to the job-console and you must fetch the logs from wherever DataProc is dumping the YARN container logs to (again, something that must be configured)
Upvotes: 1
Reputation: 1383
Yes, you can, by specifying --properties spark.submit.deployMode=cluster
. Just note that driver output will be in yarn userlogs (you can access them in Stackdriver Logging from the Console). We run in client mode by default to stream driver output to you.
Upvotes: 8