Reputation: 5396

How do I run pyspark with jupyter notebook?

I am trying to fire the jupyter notebook when I run the command pyspark in the console. When I type it now, it only starts and interactive shell in the console. However, this is not convenient to type long lines of code. Is there are way to connect the jupyter notebook to pyspark shell? Thanks.

Upvotes: 6

Answers (6)

kushagra deep

Reputation: 522

Simple Steps to Run Spark with Jupyter Notebook

1.) Install Spark Binaries independently from Apache Foundation website and add Spark binaries to PATH

2.) ADD the following entries in your .bash_profile or .bashrc

export PYSPARK_DRIVER_PYTHON='jupyter'

export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=<port-number>

3.) Install findspark package in your conda environment

conda install -c conda-forge findspark

4.) Open jupyter notebook

5.) Run below commands in a cell

findspark.init()
findspark.find()
import pyspark
findspark.find()

6.) Create Spark Session :

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('abc').getOrCreate()

7.) Read Files and Do whatever operations you want

df=spark.read.csv("file-path")

Upvotes: 0

Pranay Aryal

Reputation: 5396

cd project-folder/
pip install virtualenv
virtualenv venv

This should create a folder "venv/" inside your project folder.

Run the virtualenv by typing

source venv/bin/activate
pip install jupyter

This should start your virtualenv. Then go to ~/.bash_profile and type

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

Then type source ~/.bash_profile in the console. You should be good to go after this. If you type pyspark in the console, a jupyter notebook will fire-up

You can also check that object sqlConnector is available in your notebook by typing sqlConnector and executing the ipython notebook cell.

You can also check out Unable to load pyspark inside virtualenv

Upvotes: 4

user1464878

Reputation:

Download spark from the website i have downloaded spark-2.2.0-bin-hadoop2.7,jupyter-notebook

mak@mak-Aspire-A515-51G:~$ chmod -R 777 spark-2.2.0-bin-hadoop2.7
mak@mak-Aspire-A515-51G:~$ export SPARK_HOME='/home/mak/spark-2.2.0-bin- 
hadoop2.7'
mak@mak-Aspire-A515-51G:~$ export PATH=$SPARK_HOME:$PATH
mak@mak-Aspire-A515-51G:~$ export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
mak@mak-Aspire-A515-51G:~$ export PYSPARK_DRIVER_PYTHON="jupyter"
mak@mak-Aspire-A515-51G:~$ export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
mak@mak-Aspire-A515-51G:~$ export PYSPARK_PYTHON=python3

Go to spark directory and open python3 and import spark it will be succesful.

 mak@mak-Aspire-A515-51G:~/spark-2.2.0-bin-hadoop2.7/python$ python3
 Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) 
 [GCC 7.2.0] on linux
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import pyspark

mak@mak-Aspire-A515-51G:~/spark-2.2.0-bin-hadoop2.7/python$ jupyter-notebook
import pyspark

if you want to open jupyter from the outside of Spark directory then you need to follow the below steps

mak@mak-Aspire-A515-51G:~$ pip3 install findspark

mak@mak-Aspire-A515-51G:~$ python
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pyspark'
>>> import findspark
>>> findspark.init('/home/mak/spark-2.2.0-bin-hadoop2.7')
>>> import pyspark


mak@mak-Aspire-A515-51G:~$ jupyter-notebook 
import findspark
findspark.init('/home/mak/spark-2.2.0-bin-hadoop2.7')
import pyspark

Upvotes: 0

geo

Reputation: 546

I'm assuming you already have spark and jupyter notebooks installed and they work flawlessly independent of each other.

If that is the case, then follow the steps below and you should be able to fire up a jupyter notebook with a (py)spark backend.

Go to your spark installation folder and there should be a bin directory there: /path/to/spark/bin
Create a file, let's call it start_pyspark.sh

Open start_pyspark.sh and write something like:

    #!/bin/bash

export PYSPARK_PYTHON=/path/to/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/path/to/anaconda3/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=8880"

pyspark "$@"

Replace the /path/to ... with the path where you have installed your python and jupyter binaries respectively.

Most probably this step is already done, but just in case
Modify your ~/.bashrc file by adding the following lines

    # Spark
    export PATH="/path/to/spark/bin:/path/to/spark/sbin:$PATH"
    export SPARK_HOME="/path/to/spark"
    export SPARK_CONF_DIR="/path/to/spark/conf"

Run source ~/.bashrc and you are set.

Go ahead and try start_pyspark.sh.
You could also give arguments to the script, something like start_pyspark.sh --packages dibbhatt:kafka-spark-consumer:1.0.14.

Hope it works out for you.

Upvotes: 7

ktdrv

Reputation: 3673

Assuming you have Spark installed wherever you are going to run Jupyter, I'd recommend you use findspark. Once you pip install findspark, you can just

import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(appName="myAppName")

... and go

Upvotes: 6

Frank

Reputation: 512

Save your self a lot of configuration headaches, just run a Docker container: https://hub.docker.com/r/jupyter/all-spark-notebook/

Upvotes: 1

How do I run pyspark with jupyter notebook?

Answers (6)

Related Questions