Initialize virtual environment from requirements.txt while submitting PySpark job to Google Dataproc

Question

I wanted to submit a PySpark job in a Dataproc cluster running Python 3 by default. I wanted to initialize the environment with the virtual env I have.

I tried two ways, One is to zip the entire venv as and upload it as archive and submit it to the cluster. But my job was not able to find the dependencies. e.g

gcloud dataproc jobs submit pyspark --project=** --region=** --cluster=** \
  --archives gs://**/venv.zip#venv \
  --properties spark.pyspark.driver.python=venv/bin/python \
  gs://****.main.py

Second method was that I tried to tell spark to create a virtual env for me and install the requirements from the requirements file provided to me as mentioned in the link

Pyspark with Virtual env

But both the approach failed. Can anyone help? Plus I don't want to go the post initialization script way of Dataproc. I would really want to avoid that.

Initialize virtual environment from requirements.txt while submitting PySpark job to Google Dataproc

Answers (1)

Related Questions