Specifying python files for jupyter notebook on a Spark cluster

Question

I am running a jupyter-notebook on a Spark cluster (with yarn). I am using the "findspark" package to set up the notebook and it works perfectly fine (I connect to the cluster master through a SSH tunnel). When I write a "self-contained" notebook, it works perfectly, e.g. the following code runs with no problem:

import findspark
findspark.init()

import pyspark

sc = pyspark.SparkContext(appName='myApp')
a = sc.range(1000,numSlices=10)
a.take(10)
sc.stop()

The Spark job is perfectly distributed on the workers. However, when I want to use a python package that I wrote, the files are missing on the workers.

When I am not using Jupyter-notebook and when I use spark-submit --master yarn --py-files myPackageSrcFiles.zip, my Spark job works fine, e.g. the following code runs correctly:

main.py

import pyspark
from myPackage import myFunc

sc = pyspark.SparkContext(appName='myApp')
a = sc.range(1000,numSlices=10)
b = a.map(lambda x: myFunc(x)) 
b.take(10)
sc.stop()

Then

spark-submit --master yarn --py-files myPackageSrcFiles.zip main.py

The question is: How to run main.py from a jupyter notebook? I tried specifying the .zip package in the SparkContext with the pyfiles keyword but I got an error...

Alper t. Turker · Accepted Answer

I tried specifying the .zip package in the SparkContext with the pyfiles keyword but I got an error

It is camel case:

sc = pyspark.SparkContext(appName='myApp', pyFiles=["myPackageSrcFiles.zip"])

Or you can addPyFile

sc.addPyFile("myPackageSrcFiles.zip")

Specifying python files for jupyter notebook on a Spark cluster

Answers (1)

Related Questions