Reputation: 117
When I'm running on a vanilla spark cluster, and wanting to run a pyspark script against a specific virtualenv, I can create the virtual environment, install packages as needed, and then zip the environment into a file, let's say venv.zip
.
Then, at runtime, I can execute
spark-submit --archives venv.zip#VENV --master yarn script.py
and then, so long as I run
os.environ["PYSPARK_PYTHON"] = "VENV/bin/python"
inside of script.py, the code will run against the virtual environment, and spark will handle provisioning the virtualenvironment to all of my clusters.
When I do this on dataproc, first, the hadoop-style hash aliasing doesn't work, and second, running
gcloud dataproc jobs submit pyspark script.py --archives venv.zip --cluster <CLUSTER_NAME>
with os.environ["PYSPARK_PYTHON"] = "venv.zip/bin/python"
will produce:
Error from python worker:
venv/bin/python: 1: venv.zip/bin/python: Syntax error: word unexpected (expecting ")")
It's clearly seeing my python executable, and trying to run against it, but there really appears to be some sort of parsing error. What gives? Is there any way to pass the live python executable to use to dataproc the way that you can against a vanilla spark cluster?
Upvotes: 1
Views: 1257
Reputation: 117
Turns out I was distributing python binaries across OSes, and was boneheaded enough to not notice that I was doing so, and the incompatibility was causing the crash.
Upvotes: 2