Reputation: 5396
I am trying to fire the jupyter notebook when I run the command pyspark
in the console. When I type it now, it only starts and interactive shell in the console. However, this is not convenient to type long lines of code. Is there are way to connect the jupyter notebook to pyspark shell? Thanks.
Upvotes: 6
Views: 17968
Reputation: 522
Simple Steps to Run Spark with Jupyter Notebook
1.) Install Spark Binaries independently from Apache Foundation website and add Spark binaries to PATH
2.) ADD the following entries in your .bash_profile or .bashrc
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=<port-number>
3.) Install findspark package in your conda environment
conda install -c conda-forge findspark
4.) Open jupyter notebook
5.) Run below commands in a cell
findspark.init()
findspark.find()
import pyspark
findspark.find()
6.) Create Spark Session :
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
7.) Read Files and Do whatever operations you want
df=spark.read.csv("file-path")
Upvotes: 0
Reputation: 5396
cd project-folder/
pip install virtualenv
virtualenv venv
This should create a folder "venv/" inside your project folder.
Run the virtualenv by typing
source venv/bin/activate
pip install jupyter
This should start your virtualenv. Then go to ~/.bash_profile and type
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Then type source ~/.bash_profile
in the console.
You should be good to go after this.
If you type pyspark
in the console, a jupyter notebook will fire-up
You can also check that object sqlConnector
is available in your notebook by typing sqlConnector
and executing the ipython notebook cell.
You can also check out Unable to load pyspark inside virtualenv
Upvotes: 4
Reputation:
Download spark from the website i have downloaded spark-2.2.0-bin-hadoop2.7,jupyter-notebook
mak@mak-Aspire-A515-51G:~$ chmod -R 777 spark-2.2.0-bin-hadoop2.7
mak@mak-Aspire-A515-51G:~$ export SPARK_HOME='/home/mak/spark-2.2.0-bin-
hadoop2.7'
mak@mak-Aspire-A515-51G:~$ export PATH=$SPARK_HOME:$PATH
mak@mak-Aspire-A515-51G:~$ export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
mak@mak-Aspire-A515-51G:~$ export PYSPARK_DRIVER_PYTHON="jupyter"
mak@mak-Aspire-A515-51G:~$ export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
mak@mak-Aspire-A515-51G:~$ export PYSPARK_PYTHON=python3
Go to spark directory and open python3 and import spark it will be succesful.
mak@mak-Aspire-A515-51G:~/spark-2.2.0-bin-hadoop2.7/python$ python3
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
mak@mak-Aspire-A515-51G:~/spark-2.2.0-bin-hadoop2.7/python$ jupyter-notebook
import pyspark
if you want to open jupyter from the outside of Spark directory then you need to follow the below steps
mak@mak-Aspire-A515-51G:~$ pip3 install findspark
mak@mak-Aspire-A515-51G:~$ python
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pyspark'
>>> import findspark
>>> findspark.init('/home/mak/spark-2.2.0-bin-hadoop2.7')
>>> import pyspark
mak@mak-Aspire-A515-51G:~$ jupyter-notebook
import findspark
findspark.init('/home/mak/spark-2.2.0-bin-hadoop2.7')
import pyspark
Upvotes: 0
Reputation: 546
I'm assuming you already have spark and jupyter notebooks installed and they work flawlessly independent of each other.
If that is the case, then follow the steps below and you should be able to fire up a jupyter notebook with a (py)spark backend.
Go to your spark installation folder and there should be a bin
directory there:
/path/to/spark/bin
Create a file, let's call it start_pyspark.sh
Open start_pyspark.sh
and write something like:
#!/bin/bashexport PYSPARK_PYTHON=/path/to/anaconda3/bin/python export PYSPARK_DRIVER_PYTHON=/path/to/anaconda3/bin/jupyter export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=8880" pyspark "$@"
Replace the /path/to ...
with the path where you have installed your python and jupyter binaries respectively.
Most probably this step is already done, but just in case
Modify your ~/.bashrc
file by adding the following lines
# Spark export PATH="/path/to/spark/bin:/path/to/spark/sbin:$PATH" export SPARK_HOME="/path/to/spark" export SPARK_CONF_DIR="/path/to/spark/conf"
Run source ~/.bashrc
and you are set.
Go ahead and try start_pyspark.sh
.
You could also give arguments to the script, something like
start_pyspark.sh --packages dibbhatt:kafka-spark-consumer:1.0.14
.
Hope it works out for you.
Upvotes: 7
Reputation: 3673
Assuming you have Spark installed wherever you are going to run Jupyter, I'd recommend you use findspark. Once you pip install findspark
, you can just
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")
... and go
Upvotes: 6
Reputation: 512
Save your self a lot of configuration headaches, just run a Docker container: https://hub.docker.com/r/jupyter/all-spark-notebook/
Upvotes: 1