Reputation: 1
I am trying to submit pyspark code with pandas udf (to use fbprophet...) it works well in local submit but gets error in cluster submit such as
Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 41, ip-172-31-11-94.ap-northeast-2.compute.internal, executor 2): java.io.IOException: Cannot run program
"/mnt/yarn/usercache/hadoop/appcache/application_1620263926111_0229/container_1620263926111_0229_01_000001/environment/bin/python": error=2, No such file or directory
my spark-submit code:
PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python \
--jars jars/org.elasticsearch_elasticsearch-spark-20_2.11-7.10.2.jar \
--py-files dependencies.zip \
--archives ./environment.tar.gz#environment \
--files config.ini \
$1
I made environment.tar.gz by conda pack, dependencies.zip as my local packages and config.ini to load settings
Is there anyway to handle this problem?
Upvotes: 0
Views: 2763
Reputation: 159
You can't use local path:
--archives ./environment.tar.gz#environment
Publish environment.tar.gz
on hdfs
venv-pack -o environment.tar.gz
# or conda pack
hdfs dfs -put -f environment.tar.gz /spark/app_name/
hdfs dfs -chmod 0664 /spark/app_name/environment.tar.gz
And change argument of spark-submit
--archives hdfs:///spark/app_name/environment.tar.gz#environment
More info: PySpark on YARN in self-contained environments
Upvotes: 1