Reputation: 647
I am trying to run pyspark in Zeppelin and python3 (3.5) against Spark 2.1.0. I have got the pyspark shell up and running with python3 but flipping over to Zeppelin connecting to the same local cluster gives:
Exception: Python in worker has different version 3.5 than that in driver 2.7, PySpark cannot run with different minor versions
I have modified the default spark-env.sh as follows: (unmodified lines omitted for brevity)
SPARK_LOCAL_IP=127.0.0.1
SPARK_MASTER_HOST="localhost"
SPARK_MASTER_WEBUI_PORT=8080
SPARK_MASTER_PORT=7077
SPARK_DAEMON_JAVA_OPTS="-Djava.net.preferIPv4Stack=true"
export PYSPARK_PYTHON=/Library/Frameworks/Python.framework/Versions/3.5/bin/python3
export PYSPARK_DRIVER_PYTHON=/Library/Frameworks/Python.framework/Versions/3.5/bin/ipython
Staring things up ./bin/pyspark
and all is good in the shell.
Zeppelin config has been modified in zeppelin-site.xml only to move the ui port away from 8080 to 8666. `zeppelin-env.sh' has been modified as follows: (showing only mods/additions)
export MASTER=spark://127.0.0.1:7077
export SPARK_APP_NAME=my_zeppelin-mf
export PYSPARK_PYTHON=/Library/Frameworks/Python.framework/Versions/3.5/bin/python3
export PYSPARK_DRIVER_PYTHON=/Library/Frameworks/Python.framework/Versions/3.5/bin/ipython
export PYTHONPATH=/Library/Frameworks/Python.framework/Versions/3.5/bin/python3
I've tried using Anaconda but python 3.6 is currently creating issues with Spark. Also, I've used a bunch of combinations of the above config settings w/o success.
There is a setting referenced in the configs zeppelin.pyspark.python
which defaults to python
but it is unclear from the docs how/where to adjust that to python3. To help eliminate OSX specifics, I was able to replicate this failure on LinuxMint 18.1 as well.
So I've been rifling through the Zeppelin docs and the Internet trying to find the proper config setting to get Zeppelin to run as a 3.5 driver. With hope I'm missing something obvious, but I cannot seem to track this one down. Hoping someone has done this successfully and can help identify my error.
Thank you.
Upvotes: 5
Views: 10692
Reputation: 12260
If you use HDP sandbox also see
Might also work for Python 3 if adapted accordingly:
The following steps are for Python 2.7.14 version:
Download and Install Python 2.7.14 in /usr/local/Python-2.7.14:
# wget http://python.org/ftp/python/2.7.14/Python-2.7.14.tar.bz2
# tar xvf Python-2.7.14.tar.bz2
# chown –R root:root Python-2.7.14
# cd Python-2.7.14
# ./configure
# make altinstall prefix=/usr/local/Python-2.7.14 exec-prefix=/usr/local/Python-2.7.14
Add following values in spark-env template under advanced spark-env section of Spark configuration in Ambari UI and restart Spark:
export PYSPARK_PYTHON=/usr/local/Python-2.7.14/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/usr/local/Python-2.7.14/bin/python2.7
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/usr/local/Python-2.7.14/bin/python2.7”
Change the following value in Spark Interpreter configuration in Zeppelin, save and restart the Interpreter:
zeppelin.pyspark.python = /usr/local/Python-2.7.14/bin/python2.7 (Default value is python)
Add and execute the following in notebook for validation:
%spark.pyspark
import sys
print (sys.version)
The following should be the output from above notebook:
2.7.14 (default, Oct 4 2017, 09:43:59)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-18)]
Upvotes: 0
Reputation: 742
The correct way to set the %pyspark
interpreter to use python 3 through the Zeppelin UI is the following (tested on apache/zeppelin docker container).
Interpreter
menu itemspark
interpreterzeppelin.pyspark.python
config property to python3
Save
Restart
at the top right corner to restart the interpreterYou can now check that the version of python opening a Notebook and running the following code:
%pyspark
import sys
sys.version_info
You should see something like this as output
sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
You must also set the PYSPARK_PYTHON and PYSPARK_PYTHON_DRIVER to python3 like this (example using apache/zeppelin docker container assuming zeppelin as container name):
docker exec -it zeppelin bash
cd /zeppelin/conf
cp zeppelin-env.sh.template zeppelin-env.sh
cat <<EOF >> zeppelin-env.sh
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
EOF
# Now, press Ctrl+d to exit the container
docker restart zeppelin
Now everything has been set to run pyspark with python3
Upvotes: 9
Reputation: 647
Naturally, something worked right after posting this...
In the Zeppelin config at ./conf/interpreter.json, for one of my notebooks I found the config:
"properties": {
...
"zeppelin.pyspark.python": "python",
...
}
Changing this to:
"properties": {
...
"zeppelin.pyspark.python": "python3",
...
}
(Combined with the same settings as above)
Has had the desired effect of getting the notebook work with python 3.5. However, this seems a bit clunky/hacky and I suspect there is a more elegant way to do this. So I won't call this a solution/answer, but more of a work around.
Upvotes: 6