lubby
lubby

Reputation: 1

Pyspark yarn cluster submit error (Cannot run Program Python)

I am trying to submit pyspark code with pandas udf (to use fbprophet...) it works well in local submit but gets error in cluster submit such as

Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 41, ip-172-31-11-94.ap-northeast-2.compute.internal, executor 2): java.io.IOException: Cannot run program
 "/mnt/yarn/usercache/hadoop/appcache/application_1620263926111_0229/container_1620263926111_0229_01_000001/environment/bin/python": error=2, No such file or directory

my spark-submit code:

PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python     \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python     \
--jars jars/org.elasticsearch_elasticsearch-spark-20_2.11-7.10.2.jar \
--py-files dependencies.zip   \
--archives ./environment.tar.gz#environment \
--files config.ini \
$1

I made environment.tar.gz by conda pack, dependencies.zip as my local packages and config.ini to load settings

Is there anyway to handle this problem?

Upvotes: 0

Views: 2763

Answers (1)

Amir Bashiri
Amir Bashiri

Reputation: 159

You can't use local path:

  --archives ./environment.tar.gz#environment

Publish environment.tar.gz on hdfs

venv-pack -o environment.tar.gz
# or conda pack

hdfs dfs -put -f environment.tar.gz /spark/app_name/
hdfs dfs -chmod 0664 /spark/app_name/environment.tar.gz

And change argument of spark-submit

  --archives hdfs:///spark/app_name/environment.tar.gz#environment

More info: PySpark on YARN in self-contained environments

Upvotes: 1

Related Questions