Reputation: 588
I am using google dataproc cluster to run spark job, the script is in python.
When there is only one script (test.py for example), i can submit job with the following command:
gcloud dataproc jobs submit pyspark --cluster analyse ./test.py
But now test.py import modules from other scripts written by myself, how can i specify the dependency in the command ?
Upvotes: 3
Views: 7277
Reputation: 3293
If you have a structure as
- maindir - lib - lib.py
- run - script.py
You could include additional files with the --files flag or the --py-files flag
gcloud dataproc jobs submit pyspark --cluster=clustername --region=regionname --files /lib/lib.py /run/script.py
and you can import in script.py as
from lib import something
However, I am not aware of a method to avoid the tedious process of adding the file list manually. Please check Submit a python project to dataproc job for a more detailed explaination
Upvotes: 1