Reputation: 183
Below is my dataproc job submit command. I pass the project artifacts as a zip file to the "--files" flag
gcloud dataproc jobs submit pyspark --cluster=test_cluster --region us-central1 gs://test-gcp-project-123/main.py --files=gs://test-gcp-project-123/app_code_v2.zip
Following are the contents of "app_code_v2.zip".
I'm able to add "app_code_v2.zip" to the path using below code snippet, and access the python modules, but how do I access the "yml" files present in the zip package? those yml files contains the configs. Should I explicitly unzip the folder and copy to the working directory of the master node? Is there a better way to handle this?
if os.path.exists('app_code_v2.zip'):
sys.path.insert(0, 'app_code_v2.zip')
Upvotes: 3
Views: 1247
Reputation: 26528
You might want to 1) extract the YAML files first, then add them explicitly to the flag like --files=<zip>,<yaml>,...
, 2) or use --archives=<zip>
, which will be automatically extracted to executor work dirs. In both ways, you can get the actual path of the file with SparkFiles.get(filename). See more info on the flags in this doc.
Note that files passed through --files
and --archives
are available for Spark executors only. This behavior is consistent with spark-submit
. If you need the files to be accessible by Spark driver, consider using an init action to put the files somewhere in the local filesystem explictly.
Upvotes: 0