Reputation: 101
I'm trying to use jupyter-notebook (v4.2.2)
remotely on a spark cluster (v2.0)
, but when I run the following command it does not run on spark but only runs locally:
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" pyspark --master spark://**spark_master_hostname**:7077
When I run pyspark
alone with the same --master argument
, the process shows up in "Running Applications"
for the spark cluster just fine.
pyspark --master spark://**spark_master_hostname**:7077
It's almost as if pyspark is not being run in the former. Is there something wrong with the first command preventing jupyter from running on the spark cluster or a better way of running notebooks on a spark cluster?
Upvotes: 8
Views: 2862
Reputation: 2567
The solution to this problem may require tunneling. I've set up the following instructions for my company.
You can make a few environment changes to have pyspark default ipython or a jupyter notebook.
Put the following in your ~/.bashrc
export PYSPARK_PYTHON=python3 ## for python3
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7000"
See: pyspark on GitHub
Next, run source ~/.bashrc
Then, when you launch pyspark --master yarn
(Spark with YARN) it will open up a server for you to connect to.
On a local terminal that has ssh capabilities, run
ssh -N -f -L localhost:8000:localhost:7000 <username>@<host>
If you're on Windows, I recommend MobaXterm or Cygwin.
Open up a web browser, and enter the address localhost:8000
to tunnel into your notebook with Spark
Some precautions, I've never tried this with Python 3 so if you are using Python 3 as your default, it may require additional settings.
Upvotes: 0
Reputation: 61
It looks that you want to load IPython shell, not IPython notebook and use PySpark through command line?
IMO Jupiter UI is more convenient way to work with notebooks.
You can run jupyter server:
jupyter notebook
then (using jupyter UI) start new Python2 kernel. In opened notebook create SparkContext with configuration pointing to your spark cluster:
from pyspark import SparkContext, SparkConf
conf = SparkConf()
conf.setMaster('spark://**spark_master_hostname**:7077')
conf.setAppName('some-app-name')
sc = SparkContext(conf=conf)
Now you have pyspark application started on spark cluster and you can interact with it via created SparkContext. i.e.:
def mod(x):
import numpy as np
return (x, np.mod(x, 2))
rdd = sc.parallelize(range(1000)).map(mod).take(10)
print rdd
The code above will be computed remotely.
Upvotes: 2