arj
arj

Reputation: 703

make Pyspark working inside jupyterhub

I have a machine with JupyterHub (Python2,Python3,R and Bash Kernels). I have Spark(scala) and off course PySpark working. I can even use PySpark inside an interactive IPython notebook with a command like:

IPYTHON_OPTS="notebook" $path/to/bin/pyspark

(this open a Jupyter notebook and inside Python2 I can use Spark)

BUT I can't get PySpark working inside JupyterHub.

the spark kernel is more than what i really need.

I only need Pyspark inside JupyterHub. Any suggestion ?

thanks.

Upvotes: 3

Views: 5866

Answers (4)

I have created a public gist to configure spark2.x with jupyterhub & cdh5.13 cluster.

Upvotes: 0

lmtx
lmtx

Reputation: 5586

You need to configure the pyspark kernel.

On my server jupyter kernels are located at:

/usr/local/share/jupyter/kernels/

You can create a new kernel by making a new directory:

mkdir /usr/local/share/jupyter/kernels/pyspark

Then create the kernel.json file - I paste my as a reference:

{
 "display_name": "pySpark (Spark 1.6.0)",
 "language": "python",
 "argv": [
  "/usr/local/bin/python2.7",
  "-m",
  "ipykernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "PYSPARK_PYTHON": "/usr/local/bin/python2.7",
  "SPARK_HOME": "/usr/lib/spark",
  "PYTHONPATH": "/usr/lib/spark/python/lib/py4j-0.9-src.zip:/usr/lib/spark/python/",
  "PYTHONSTARTUP": "/usr/lib/spark/python/pyspark/shell.py",
  "PYSPARK_SUBMIT_ARGS": "--master yarn-client pyspark-shell"
 }
}

Adjust the paths and python versions and your pyspark kernel is good to go.

Upvotes: 6

mdurant
mdurant

Reputation: 28683

You could start jupyter as usual, and add the following to the top of your code:

import sys
sys.path.insert(0, '<path>/spark/python/')
sys.path.insert(0, '<path>/spark/python/lib/py4j-0.8.2.1-src.zip')
import pyspark
conf = pyspark.SparkConf().set<conf settings>
sc = pyspark.SparkContext(conf=conf)

and change the parts in angled brackets as appropriate for you.

Upvotes: 4

Ophir Yoktan
Ophir Yoktan

Reputation: 8449

I didn't try it with jupiter hub, but this approach helped me with other tools (like spyder)

I understand the jupiter server is itself a python script. so: copy (or rename) jupyterhub to jupyterhub.py

run:

spark-submit jupyterhub.py

(replace spark-submit and jupyterhub.py with the full path of those files)

Upvotes: 0

Related Questions