John
John

Reputation: 1167

ModuleNotFoundError on running PySpark in Dataproc

Running a pyspark job on GCP (using dataproc 1.4), where I'm trying to read from GCP storage. Getting the following error:

    from google.cloud import storage
  File "/opt/conda/default/lib/python3.6/site-packages/google/cloud/storage/__init__.py", line 38, in <module>
    from google.cloud.storage.blob import Blob
  File "/opt/conda/default/lib/python3.6/site-packages/google/cloud/storage/blob.py", line 54, in <module>
    from google.cloud.iam import Policy
ModuleNotFoundError: No module named 'google.cloud.iam'

Thought that all google.cloud dependencies would be loaded by default in the environment; also tried adding 'PIP_PACKAGES=google-cloud-iam==0.1.0' when I created the cluster, but no luck.

EDIT: More general question - pip install is not recognizing python packages with hyphens (e.g. 'PIP_PACKAGES=google-cloud-storage'). What escape pattern I should use to get this to work?

Upvotes: 1

Views: 1012

Answers (1)

tix
tix

Reputation: 2158

it should not be necessary to use the storage APIs to read from GCS. Instead use the GCS connector provided by Dataproc [1] (its already on the classpath so no further action is necessary).

It is implemented as a hadoop file system so any spark read or write API will be able to accept a URI of the form gs://my-bucket/.... For instance:

sc.textFile("gs://my-bucket/example.txt")

Globbing should also work.

[1] https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage

Upvotes: 1

Related Questions