Run specific virtualenv on dataproc cluster at spark-submit like in vanilla Spark

Question

When I'm running on a vanilla spark cluster, and wanting to run a pyspark script against a specific virtualenv, I can create the virtual environment, install packages as needed, and then zip the environment into a file, let's say venv.zip.

Then, at runtime, I can execute

spark-submit --archives venv.zip#VENV --master yarn script.py

and then, so long as I run

os.environ["PYSPARK_PYTHON"] = "VENV/bin/python" inside of script.py, the code will run against the virtual environment, and spark will handle provisioning the virtualenvironment to all of my clusters.

When I do this on dataproc, first, the hadoop-style hash aliasing doesn't work, and second, running

gcloud dataproc jobs submit pyspark script.py --archives venv.zip --cluster

with os.environ["PYSPARK_PYTHON"] = "venv.zip/bin/python" will produce:

Error from python worker: venv/bin/python: 1: venv.zip/bin/python: Syntax error: word unexpected (expecting ")")

It's clearly seeing my python executable, and trying to run against it, but there really appears to be some sort of parsing error. What gives? Is there any way to pass the live python executable to use to dataproc the way that you can against a vanilla spark cluster?

Zo the Relativist · Accepted Answer

Turns out I was distributing python binaries across OSes, and was boneheaded enough to not notice that I was doing so, and the incompatibility was causing the crash.

Run specific virtualenv on dataproc cluster at spark-submit like in vanilla Spark

Answers (1)

Related Questions