Reputation: 8298
What is pyspark
actually doing except importing packages properly? Is it possible to use a regular jupyter notebook
and then import what is needed?
Upvotes: 2
Views: 8777
Reputation: 77
You could do the following
spark = SparkSession.builder.appName("appname")\
.config('spark.jars.packages', 'org.postgresql:postgresql:42.5.4')\
.getOrCreate()
as seen in https://blog.devgenius.io/spark-installing-external-packages-2e752923392e
Upvotes: 0
Reputation: 22832
Assuming you haven't already created the context, what I like to is set the submit args using PYSPARK_SUBMIT_ARGS
:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-memory 15g --packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'
Upvotes: 2
Reputation: 21
You could consider creating a pySpark kernel for Jupyter - it would import pyspark packages for you.
Create file (need to create directory first; for older versions it might be located somewhere else):
~/.local/share/jupyter/kernels/pyspark/kernel.json
with the following content:
{
"display_name": "pySpark (Spark 1.6.0)",
"language": "python",
"argv": [
"/usr/bin/python2",
"-m",
"IPython.kernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/usr/local/lib/spark-1.6.0-bin-hadoop2.6",
"PYTHONPATH": "/usr/local/lib/spark-1.6.0-bin-hadoop2.6/python/:/usr/local/lib/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip",
"PYTHONSTARTUP": "/usr/local/lib/spark-1.6.0-bin-hadoop2.6/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "pyspark-shell"
}
}
Change the Spark's paths appropriately.
Upvotes: 2
Reputation: 330123
Yes, it is possible but can be painful. While Python alone is not an issue and all you need is to set $SPARK_HOME
, add $SPARK_HOME/python
(and if not accessible otherwise $SPARK_HOME/python/lib/py4j-[VERSION]-src.zip
) PySpark script handles JVM setup as well (--packages
, --jars
--conf
, etc.).
This can be handled using PYSPARK_SUBMIT_ARGS
variable or using $SPARK_HOME/conf
(see for example How to load jar dependenices in IPython Notebook).
There is an old blog post from Cloudera which describes example configuration and, as far a I remember, still works.
Upvotes: 3