Reputation: 391
I create a cluster with Google Cloud Dataproc. I can submit job to the cluster just fine until I do
pip3 install pyspark
on the cluster. After that, each time I try to submit a job, I received an error:
Could not find valid SPARK_HOME while searching ['/tmp', '/usr/local/bin']
/usr/local/bin/spark-submit: line 27: /bin/spark-class: No such file or directory
I notice that even before pyspark was installed, SPARK_HOME was not set to anything. However I can submit the job just fine. I wonder why does install pyspark cause this problem and how to fix it?
Upvotes: 0
Views: 1495
Reputation: 528
brew install apache-spark actually already provides a working pyspark shell. It is not necessary to additionally pip install pyspark
Upvotes: 0
Reputation: 1383
Pyspark is already pre-installed on Dataproc -- you should invoke the pyspark
command rather than python
. For now, trying to pip install pyspark or py4j will break pyspark on Dataproc. You also need to be careful not to pip install any packages that depend on pyspark/py4j. We're aware of this issue :)
If you're just trying to switch to Python 3, currently the easiest way to do that is to run the miniconda initialization action when creating your cluster: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/conda/. That init action conveniently also allows you to specify extra pip or conda packages to install.
We are also aware that pyspark
isn't on PYTHONPATH
for the python interpreter. For now, if you want to run pyspark code, use the pyspark
command. Note that the pyspark
command sources /etc/spark/conf/spark-env.sh
, which you would have to do manually if you wanted to run import pyspark
in a python
shell.
Side note: rather than SSHing into the cluster and running pyspark
, consider running gcloud dataproc jobs submit pyspark
(docs) from your workstation or using Jupyter notebook.
Upvotes: 1