肉肉Linda
肉肉Linda

Reputation: 588

how to submit pyspark job with dependency on google dataproc cluster

I am using google dataproc cluster to run spark job, the script is in python.

When there is only one script (test.py for example), i can submit job with the following command:

gcloud dataproc jobs submit pyspark --cluster analyse ./test.py

But now test.py import modules from other scripts written by myself, how can i specify the dependency in the command ?

Upvotes: 3

Views: 7277

Answers (2)

Galuoises
Galuoises

Reputation: 3293

If you have a structure as

- maindir - lib - lib.py
          - run - script.py

You could include additional files with the --files flag or the --py-files flag

gcloud dataproc jobs submit pyspark --cluster=clustername --region=regionname --files /lib/lib.py /run/script.py

and you can import in script.py as

from lib import something

However, I am not aware of a method to avoid the tedious process of adding the file list manually. Please check Submit a python project to dataproc job for a more detailed explaination

Upvotes: 1

tix
tix

Reputation: 2158

You could use the --py-files option mentioned here.

Upvotes: 3

Related Questions