Under what circumstances must I use py-files option of spark-submit?

Question

Just poking around spark-submit, I was under the impression that if my application has got dependencies on other .py files then I have to distribute them using the py-files option (see bundling your applications dependencies). I took that to mean any file had to be declared using py-files yet the following works fine... two .py files:

spark_submit_test_lib.py:

def do_sum(sc) :
  data = [1, 2, 3, 4, 5]
  distData = sc.parallelize(data)
  return distData.sum()

and spark_submit_test.py:

from pyspark import SparkContext, SparkConf
from spark_submit_test_lib import do_sum
conf = SparkConf().setAppName('JT_test')
sc = SparkContext(conf=conf)
print do_sum(sc)

submitted using:

spark-submit --queue 'myqueue' spark_submit_test.py

All worked fine. Code ran, yields the correct result, spark-submit terminates gracefully.
However, I would have thought having read the documentation that I would have had to do this:

spark-submit --queue 'myqueue' --py-files spark_submit_test_lib.py spark_submit_test.py

That still worked of course. I'm just wondering why the former worked as well. Any suggestions?

Arunakiran Nulu · Accepted Answer

You must be submitting this in local environment where your driver and executors runs on the same machine , that is the the reason it worked ,but if you deploy in cluster and try to run from there you have to use --pf-files option.

Please check the link for more details

Under what circumstances must I use py-files option of spark-submit?

Answers (1)

Related Questions