creating dataproc cluster with multiple jars

Question

I am trying to create a dataproc cluster that will connect dataproc to pubsub. I need to add multiple jars on cluster creation in the spark.jars flag

gcloud dataproc clusters create cluster-2c76 --region us-central1 --zone us-central1-f --master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 500 \
--image-version 1.4-debian10 \
--properties spark:spark.jars=gs://bucket/jars/spark-streaming-pubsub_2.11-2.4.0.jar,gs://bucket/jars/google-oauth-client-1.31.0.jar,gs://bucket/jars/google-cloud-datastore-2.2.0.jar,gs://bucket/jars/pubsublite-spark-sql-streaming-0.2.0.jar spark:spark.driver.memory=3000m \
--initialization-actions gs://goog-dataproc-initialization-actions-us-central1/connectors/connectors.sh \
--metadata spark-bigquery-connector-version=0.21.0 \
--scopes=pubsub,datastore

I get thrown this error

ERROR: (gcloud.dataproc.clusters.create) argument --properties: Bad syntax for dict arg: [gs://gregalr/jars/spark-streaming-pubsub_2.11-2.3.4.jar]. Please see `gcloud topic flags-file` or `gcloud topic escaping` for information on providing list or dictionary flag values with special characters.

This looked promising, but fails

If there is a better way to connect dataproc to pubsub, please share

Dennis Huo · Accepted Answer

The answer you linked is the correct way to do it: How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?

If you also post the command you tried with the escaping syntax and the resulting error message then others could more easily verify what you did wrong. It looks like you're specifying an additional spark property in addition to your list of jars spark:spark.driver.memory=3000m, and tried to just space-separate that from your jars flag, which isn't allowed.

Per the linked result, you'd need to use the newly assigned separator character to separate the second spark property:

--properties=^#^spark:spark.jars.packages=artifact1,artifact2,artifact3#spark:spark.driver.memory=3000m

creating dataproc cluster with multiple jars

Answers (1)

Related Questions