Cedric H.
Cedric H.

Reputation: 8298

Import PySpark packages with a regular Jupyter Notebook

What is pyspark actually doing except importing packages properly? Is it possible to use a regular jupyter notebook and then import what is needed?

Upvotes: 2

Views: 8777

Answers (4)

Meursault
Meursault

Reputation: 77

You could do the following

spark = SparkSession.builder.appName("appname")\
        .config('spark.jars.packages', 'org.postgresql:postgresql:42.5.4')\
        .getOrCreate()

as seen in https://blog.devgenius.io/spark-installing-external-packages-2e752923392e

Upvotes: 0

Kamil Sindi
Kamil Sindi

Reputation: 22832

Assuming you haven't already created the context, what I like to is set the submit args using PYSPARK_SUBMIT_ARGS:

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-memory 15g --packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'

Upvotes: 2

Jacek Wasilewski
Jacek Wasilewski

Reputation: 21

You could consider creating a pySpark kernel for Jupyter - it would import pyspark packages for you.

Create file (need to create directory first; for older versions it might be located somewhere else):

~/.local/share/jupyter/kernels/pyspark/kernel.json

with the following content:

{
 "display_name": "pySpark (Spark 1.6.0)",
 "language": "python",
 "argv": [
  "/usr/bin/python2",
  "-m",
  "IPython.kernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "SPARK_HOME": "/usr/local/lib/spark-1.6.0-bin-hadoop2.6",
  "PYTHONPATH": "/usr/local/lib/spark-1.6.0-bin-hadoop2.6/python/:/usr/local/lib/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip",
  "PYTHONSTARTUP": "/usr/local/lib/spark-1.6.0-bin-hadoop2.6/python/pyspark/shell.py",
  "PYSPARK_SUBMIT_ARGS": "pyspark-shell"
 }
}

Change the Spark's paths appropriately.

Upvotes: 2

zero323
zero323

Reputation: 330123

Yes, it is possible but can be painful. While Python alone is not an issue and all you need is to set $SPARK_HOME, add $SPARK_HOME/python (and if not accessible otherwise $SPARK_HOME/python/lib/py4j-[VERSION]-src.zip) PySpark script handles JVM setup as well (--packages, --jars --conf, etc.).

This can be handled using PYSPARK_SUBMIT_ARGS variable or using $SPARK_HOME/conf (see for example How to load jar dependenices in IPython Notebook).

There is an old blog post from Cloudera which describes example configuration and, as far a I remember, still works.

Upvotes: 3

Related Questions