mikee thanh
mikee thanh

Reputation: 1

ModuleNotFoundError: No module named 'minio' when submitting a PySpark job on Google Cloud Dataproc

I’m facing an issue when trying to submit a PySpark job to Google Cloud Dataproc. The goal is to run a script on the Dataproc cluster that uses the minio module. However, I keep encountering the following error:

enter image description here

This is my code that i submit in dataproc:

enter image description here

My Dataproc cluster consists of 1 master node and 2 worker nodes.

How can I correctly install and use the minio module in a PySpark job on Google Cloud Dataproc?

Upvotes: 0

Views: 113

Answers (1)

yagmurkoksal
yagmurkoksal

Reputation: 56

Please share more information about your cluster and your submit command (is it serverless or more standard cluster ext.?). The potential reasons that come to my mind from what you've shared so far are:

  • While preparing the tar.gz file, the minio path may not be accessible. For this, maybe you can try to make minio a separate package and use it with the flag you shared. I would also like to remind, as I remember --py-files flag will not work in serverless data proc clusters.
  • Apart from this, you can check whether the service account you use in the dataproc cluster has access to the bucket where the minio file is located (high probability there is already, but you can double check).
  • Finally, you can try using Initialization actions as an alternative way, please check this:
    1. https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions
    2. How to install python packages in a Google Dataproc cluster

I hope your problem will be solved!

Upvotes: 0

Related Questions