user6837711
user6837711

Reputation: 101

Jupyter Notebook only runs locally on Spark

I'm trying to use jupyter-notebook (v4.2.2) remotely on a spark cluster (v2.0), but when I run the following command it does not run on spark but only runs locally:

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" pyspark --master spark://**spark_master_hostname**:7077

When I run pyspark alone with the same --master argument, the process shows up in "Running Applications" for the spark cluster just fine.

pyspark --master spark://**spark_master_hostname**:7077

It's almost as if pyspark is not being run in the former. Is there something wrong with the first command preventing jupyter from running on the spark cluster or a better way of running notebooks on a spark cluster?

Upvotes: 8

Views: 2862

Answers (2)

Jon
Jon

Reputation: 2567

The solution to this problem may require tunneling. I've set up the following instructions for my company.

You can make a few environment changes to have pyspark default ipython or a jupyter notebook.

Put the following in your ~/.bashrc

export PYSPARK_PYTHON=python3 ## for python3
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7000"

See: pyspark on GitHub

Next, run source ~/.bashrc

Then, when you launch pyspark --master yarn (Spark with YARN) it will open up a server for you to connect to.

On a local terminal that has ssh capabilities, run

ssh -N -f -L localhost:8000:localhost:7000 <username>@<host>

If you're on Windows, I recommend MobaXterm or Cygwin.

Open up a web browser, and enter the address localhost:8000 to tunnel into your notebook with Spark

Some precautions, I've never tried this with Python 3 so if you are using Python 3 as your default, it may require additional settings.

Upvotes: 0

Artur I
Artur I

Reputation: 61

It looks that you want to load IPython shell, not IPython notebook and use PySpark through command line?

IMO Jupiter UI is more convenient way to work with notebooks.

You can run jupyter server:

jupyter notebook

then (using jupyter UI) start new Python2 kernel. In opened notebook create SparkContext with configuration pointing to your spark cluster:

from pyspark import SparkContext, SparkConf
conf = SparkConf()
conf.setMaster('spark://**spark_master_hostname**:7077')
conf.setAppName('some-app-name')
sc = SparkContext(conf=conf)

Now you have pyspark application started on spark cluster and you can interact with it via created SparkContext. i.e.:

def mod(x):
    import numpy as np
    return (x, np.mod(x, 2))
rdd = sc.parallelize(range(1000)).map(mod).take(10)
print rdd

The code above will be computed remotely.

Upvotes: 2

Related Questions