Reputation: 3790
I am running a jupyter-notebook on a Spark cluster (with yarn). I am using the "findspark" package to set up the notebook and it works perfectly fine (I connect to the cluster master through a SSH tunnel). When I write a "self-contained" notebook, it works perfectly, e.g. the following code runs with no problem:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName='myApp')
a = sc.range(1000,numSlices=10)
a.take(10)
sc.stop()
The Spark job is perfectly distributed on the workers. However, when I want to use a python package that I wrote, the files are missing on the workers.
When I am not using Jupyter-notebook and when I use spark-submit --master yarn --py-files myPackageSrcFiles.zip, my Spark job works fine, e.g. the following code runs correctly:
main.py
import pyspark
from myPackage import myFunc
sc = pyspark.SparkContext(appName='myApp')
a = sc.range(1000,numSlices=10)
b = a.map(lambda x: myFunc(x))
b.take(10)
sc.stop()
Then
spark-submit --master yarn --py-files myPackageSrcFiles.zip main.py
The question is: How to run main.py from a jupyter notebook? I tried specifying the .zip package in the SparkContext with the pyfiles keyword but I got an error...
Upvotes: 6
Views: 2832
Reputation: 35249
I tried specifying the .zip package in the SparkContext with the pyfiles keyword but I got an error
It is camel case:
sc = pyspark.SparkContext(appName='myApp', pyFiles=["myPackageSrcFiles.zip"])
Or you can addPyFile
sc.addPyFile("myPackageSrcFiles.zip")
Upvotes: 6