Dataproc: Jupyter pyspark notebook unable to import graphframes package

In Dataproc spark cluster, graphframe package is available in spark-shell but not in jupyter pyspark notebook.

Pyspark kernel config:

PACKAGES_ARG='--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11'

Following is the cmd to initialize cluster :

gcloud dataproc clusters create my-dataproc-cluster --properties spark.jars.packages=com.databricks:graphframes:graphframes:0.2.0-spark2.0-s_2.11 --metadata "JUPYTER_PORT=8124,INIT_ACTIONS_REPO=https://github.com/{xyz}/dataproc-initialization-actions.git" --initialization-actions  gs://dataproc-initialization-actions/jupyter/jupyter.sh --num-workers 2 --properties spark:spark.executorEnv.PYTHONHASHSEED=0,spark:spark.yarn.am.memory=1024m     --worker-machine-type=n1-standard-4  --master-machine-type=n1-standard-4

Upvotes: 3

Answers (4)

Alex Ortner

Reputation: 1228

The simplest way is to start jupyter with pyspark and graphframes is to start jupyter out from pyspark with the additional package attached

Just open your terminal and set the two environment variables and start pyspark with the graphframes package

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11

the advantage of this is also that if you later on want to run your code via spark-submit you can use the same start command

Upvotes: 0

Parag Chaudhari

Reputation: 348

If you can use EMR notebooks then you can install additional Python libraries/dependencies using install_pypi_package() API within the notebook. These dependencies(including transitive dependencies if any) will be installed on all executor nodes.

More details here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html

Upvotes: 0

Pankaj Kumar

Reputation: 3409

I found another way to do add packages which works on Jupyter notebook:

spark = SparkSession.builder \
.appName("Python Spark SQL") \    \
.config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11") \
.getOrCreate()

Upvotes: 2

Patrick Clay

Reputation: 1349

This is an old bug with Spark Shells and YARN, that I thought was fixed in SPARK-15782, but apparently this case was missed.

The suggested workaround is adding

import os
sc.addPyFile(os.path.expanduser('~/.ivy2/jars/graphframes_graphframes-0.2.0-spark2.0-s_2.11.jar'))

before your import.

Upvotes: 4

Dataproc: Jupyter pyspark notebook unable to import graphframes package

Answers (4)

Related Questions