Bajwa
Bajwa

Reputation: 101

Dataproc doesn't import Python module stored in Google Cloud Storage bucket

I have following structure on Google Cloud Storage (GCS) bucket :

gs://my_bucket/py_scripts/
    wrapper.py
    mymodule.py
    _init__.py

I am running wrapper.py through Dataproc as a pyspark job and it imports mymodule using import mymodule at the start but the job is returning error saying no module named mymodule even though they are at the same path. This however works fine in the Unix environment.

Note that _init__.py is empty. Also tested from mymodule import myfunc but returns same error.

Upvotes: 4

Views: 2182

Answers (1)

Animesh
Animesh

Reputation: 74

Can you provide your pyspark job submit command ? I suspect you are not passing "--py-files" params to provide other python files to job. Check for reference https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/pyspark . Dataproc will not assume files in same GS bucket as input to job.

Upvotes: 2

Related Questions