Reputation: 392
I have Apache Toree installed following the instructions at https://medium.com/@faizanahemad/machine-learning-with-jupyter-using-scala-spark-and-python-the-setup-62d05b0c7f56.
However I do not manage to import packages in the pySpark kernel by using the PYTHONPATH variable in the kernel file at:
/usr/local/share/jupyter/kernels/apache_toree_pyspark/kernel.json.
Using the notebook I can see the the required .zip in the sys.path and in the os.environ[‘PYTHONPATH’], and the relevant .jar is at os.environ[‘SPARK_CLASSPATH'] the but I get
“No module named graphframe” when importing it with: import graphframe.
Any suggestion on how to get graphframe imported?
Thank you.
Upvotes: 1
Views: 439
Reputation: 3930
The quickest way to get a package like graphframes going in a Jupyter notebook, is by setting the PYSPARK_SUBMIT_ARGS
environment variable - this can be done in a running notebook server like this:
import os
os.environ["PYSPARK_SUBMIT_ARGS"] = ("--packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 pyspark-shell")
Verify that it was added, before launching the SparkContext sc = pyspark.SparkContext()
environ{...
'PYSPARK_SUBMIT_ARGS': '--packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 pyspark-shell'}
You might then find a tmp
directory in your PATH
. Check through import sys; sys.path
which should say something like this:
[...
'/tmp/spark-<###>//userFiles-<###>/graphframes_graphframes-0.7.0-spark2.4-s_2.11.jar',
'/usr/local/spark/python',
'/usr/local/spark/python/lib/py4j-0.10.7-src.zip', ...
]
This was tested with the jupyter/pyspark-notebook docker container, for which you can also set the environment variable at build time. Run docker build .
with this Dockerfile to do so:
FROM jupyter/pyspark-notebook
USER root
ENV PYSPARK_SUBMIT_ARGS --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 pyspark-shell
USER $NB_UID
Upvotes: 0
Reputation: 392
I was using the .zip from the dataframes's download page but it does not solve the problem. The correct .zip can be created following the steps in:
https://github.com/graphframes/graphframes/issues/172
Another solution was given at: Importing PySpark packages, although the --packages parameter didn't work for me.
Hope this help.
Upvotes: 1