Reputation: 12304
Just poking around spark-submit, I was under the impression that if my application has got dependencies on other .py files then I have to distribute them using the py-files option (see bundling your applications dependencies). I took that to mean any file had to be declared using py-files yet the following works fine... two .py
files:
spark_submit_test_lib.py
:
def do_sum(sc) :
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
return distData.sum()
and spark_submit_test.py
:
from pyspark import SparkContext, SparkConf
from spark_submit_test_lib import do_sum
conf = SparkConf().setAppName('JT_test')
sc = SparkContext(conf=conf)
print do_sum(sc)
submitted using:
spark-submit --queue 'myqueue' spark_submit_test.py
All worked fine. Code ran, yields the correct result, spark-submit terminates gracefully.
However, I would have thought having read the documentation that I would have had to do this:
spark-submit --queue 'myqueue' --py-files spark_submit_test_lib.py spark_submit_test.py
That still worked of course. I'm just wondering why the former worked as well. Any suggestions?
Upvotes: 0
Views: 1252
Reputation: 2099
You must be submitting this in local environment where your driver and executors runs on the same machine , that is the the reason it worked ,but if you deploy in cluster and try to run from there you have to use --pf-files option.
Please check the link for more details
Upvotes: 1