Reputation: 1867
I meet problems when I run spark-submit
with import my own python files.
spark-submit \
--master yarn \
--verbose \
--deploy-mode cluster \
--executor-memory 8g \
--driver-memory 10g \
--num-executors 100 \
--executor-cores 10 \
--py-files dgs://user/tmp/dependency.zip \
test.py
I have two python files data.py
and proccess.py
in dependency
folder. Then I zip -r dependency.zip dependency/
and get dependency.zip
.
Here is my test.py
,
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
from data import get_data
if __name__ == "__main__":
data = get_data()
distData = sc.parallelize(data)
print("done",distData.collect())
In data.py
,
def get_data():
return [1,2,3,4,5]
But I meet error. No module named data
.
Upvotes: 0
Views: 343
Reputation: 780
Make dependency
a module (by putting empty init.py file) and import data
as from dependency import data
. It should work.
Upvotes: 2