nbk
nbk

Reputation: 543

Google Dataproc Pyspark Properties

I'm trying to submit a pyspark to a google dataproc cluster, and I want to specify the properties for the pyspark configuration at the command line. The documentation says that I can specify those properties with the --properties flag. The command I'm trying to run looks something like this:

gcloud dataproc jobs submit pyspark simpleNB.py --cluster=elinorcluster  —-properties=executor-memory=10G --properties=driver-memory=46G --properties=num-executors=20 -- -i X_small_train.txt -l y_small_train.txt -u X_small_test.txt -v y_small_test.txt

I have seriously tried every combination I can think of for the properties flag:

gcloud dataproc jobs submit pyspark simpleNB.py --cluster=elinorcluster  —-properties executor-memory=10G, driver-memory=46G,properties=num-executors=20 -- -i X_small_train.txt -l y_small_train.txt -u X_small_test.txt -v y_small_test.txt

etc., but I can't seem to get it to work. Keeps giving me this error:

ERROR: (gcloud.dataproc.jobs.submit.pyspark) unrecognized arguments: —-properties=executor-memory=10G
Usage: gcloud dataproc jobs submit pyspark PY_FILE --cluster=CLUSTER [optional flags] [-- JOB_ARGS ...]
  optional flags may be  --archives | --driver-log-levels | --files | --help |
                     --jars | --labels | --properties | --py-files | -h

Does anybody know how to make this work? It says that it needs a list of key value pairs, but what is the format of the list?

Upvotes: 2

Views: 5939

Answers (3)

cloud-sdk-oncall
cloud-sdk-oncall

Reputation: 33

You should specify properties in a single flag as such:

--properties=executor-memory=10G,driver-memory=46G,num-executors=20

You can also use ':' instead of '=' to make it less ambiguous with other flags e.g.:

 --properties=executor-memory:10G,driver-memory:46G,num-executors:20

Upvotes: 2

Watacroft
Watacroft

Reputation: 352

The Pyspark properties names has to be as on this list. And the correct command syntax for properties is:

gcloud dataproc jobs submit pyspark PY_FILE --cluster=CLUSTER --properties=[PROPERTY-A=VALUE-A,PROPERTY-B=VALUE-B,…]

Upvotes: 0

nbk
nbk

Reputation: 543

The format of the list is one single string, comma separated k/v pairs, in quotes:

gcloud dataproc jobs submit pyspark simpleNB.py --cluster=elinorcluster \
  —-properties='spark.executor.memory=10G,spark.driver.memory=46G,\
  spark.num.executors=20' -- -i X_small_train.txt -l y_small_train.txt\
  -u X_small_test.txt -v y_small_test.txt

The properties also need to be legit pyspark property config syntax, which driver-memory=46G is not while spark.driver.memory=46G is.

Upvotes: -1

Related Questions