Reputation: 3625
I'm trying to set up a stand alone instance of spark locally on a mac and use the Python 3 API. To do this I've done the following, 1. I've downloaded and installed Scala and Spark. 2. I've set up the following environment variables,
#Scala
export SCALA_HOME=$HOME/scala/scala-2.12.4
export PATH=$PATH:$SCALA_HOME/bin
#Spark
export SPARK_HOME=$HOME/spark/spark-2.2.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
#Jupyter Python
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=ipython3
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
#Python
alias python="python3"
alias pip="pip3"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
Now when I run the command
pyspark --master local[2]
And type sc on the notebook, I get the following,
SparkContext
Spark UI
Version
v2.2.1
Master
local[2]
AppName
PySparkShell
Clearly my SparkContext is not initialized. I'm expecting to see an initialized SparkContext object. What am I doing wrong here?
Upvotes: 0
Views: 2496
Reputation: 60379
Well, as I have argued elsewhere, setting PYSPARK_DRIVER_PYTHON
to jupyter
(or ipython
) is a really bad and plain wrong practice, which can lead to unforeseen outcomes downstream, such as when you try to use spark-submit
with the above settings...
There is one and only one proper way to customize a Jupyter notebook in order to work with other languages (PySpark here), and this is the use of Jupyter kernels.
The first thing to do is run a jupyter kernelspec list
command, to get the list of any already available kernels in your machine; here is the result in my case (Ubuntu):
$ jupyter kernelspec list
Available kernels:
python2 /usr/lib/python2.7/site-packages/ipykernel/resources
caffe /usr/local/share/jupyter/kernels/caffe
ir /usr/local/share/jupyter/kernels/ir
pyspark /usr/local/share/jupyter/kernels/pyspark
pyspark2 /usr/local/share/jupyter/kernels/pyspark2
tensorflow /usr/local/share/jupyter/kernels/tensorflow
The first kernel, python2
, is the "default" one coming with IPython (there is a great chance of this being the only one present in your system); as for the rest, I have 2 more Python kernels (caffe
& tensorflow
), an R one (ir
), and two PySpark kernels for use with Spark 1.6 and Spark 2.0 respectively.
The entries of the list above are directories, and each one contains one single file, named kernel.json
. Let's see the contents of this file for my pyspark2
kernel:
{
"display_name": "PySpark (Spark 2.0)",
"language": "python",
"argv": [
"/opt/intel/intelpython27/bin/python2",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
"PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
"PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
}
}
Now, the easiest way for you would be to manually do the necessary changes (paths only) to my above shown kernel and save it in a new subfolder of the .../jupyter/kernels
directory (that way, it should be visible if you run again a jupyter kernelspec list
command). And if you think this approach is also a hack, well, I would agree with you, but it is the one recommended in the Jupyter documentation (page 12):
However, there isn’t a great way to modify the kernelspecs. One approach uses
jupyter kernelspec list
to find thekernel.json
file and then modifies it, e.g.kernels/python3/kernel.json
, by hand.
If you don't have already a .../jupyter/kernels
folder, you can still install a new kernel using jupyter kernelspec install
- haven't tried it, but have a look at this SO answer.
If you want to pass command-line arguments to PySpark, you should add the PYSPARK_SUBMIT_ARGS
setting under env
; for example, here is the last line of my respective kernel file for Spark 1.6.0, where we still had to use the external spark-csv package for reading CSV files:
"PYSPARK_SUBMIT_ARGS": "--master local --packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell"
Finally, don't forget to remove all the PySpark/Jupyter-related environment variables from your bash profile (leaving only SPARK_HOME
and PYSPARK_PYTHON
should be OK).
Another possibility could be to use Apache Toree, but I haven't tried it myself yet.
Upvotes: 4
Reputation: 1572
Documentation seams to say that environment variables are read from a certain file and not as shell environment variables.
Certain Spark settings can be configured through environment variables, which are read from the conf/spark-env.sh script in the directory where Spark is installed
Upvotes: 0