Reputation: 31
I am trying to run a pyspark job using yarn with the spark.shuffle.service.enabled=true option but the job never completes :
Without the option, the job works well:
user@e7524bf7f996:~$ pyspark --master yarn
Using Python version 3.9.7 (default, Sep 16 2021 13:09:58)
Spark context Web UI available at http://e7524bf7f996:4040
Spark context available as 'sc' (master = yarn, app id = application_1644937120225_0004).
SparkSession available as 'spark'.
>>> sc.parallelize(range(10)).sum()
45
With the option --conf spark.shuffle.service.enabled=true
user@e7524bf7f996:~$ pyspark --master yarn --conf spark.shuffle.service.enabled=true
Using Python version 3.9.7 (default, Sep 16 2021 13:09:58)
Spark context Web UI available at http://e7524bf7f996:4040
Spark context available as 'sc' (master = yarn, app id = application_1644937120225_0005).
SparkSession available as 'spark'.
>>> sc.parallelize(range(10)).sum()
2022-02-15 15:10:14,591 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2022-02-15 15:10:29,590 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2022-02-15 15:10:44,591 WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Are there other options in Spark or Yarn that should be enabled to make spark.shuffle.service.enabled work ?
I am running Spark 3.1.2 , Python 3.9.7, hadoop-3.2.1
Thank you,
Bertrand
Upvotes: 0
Views: 1298
Reputation: 31
Thanks Warren for your help.
Here is the setup working for me:
https://github.com/BertrandBrelier/SparkYarn/blob/main/yarn-site.xml
echo "export YARN_HEAPSIZE=2000" >> /home/user/hadoop-3.2.1/etc/hadoop/yarn-env.sh
ln -s /home/user/spark-3.1.2-bin-hadoop3.2/yarn/spark-3.1.2-yarn-shuffle.jar /home/user/hadoop-3.2.1/share/hadoop/yarn/lib/.
echo "spark.shuffle.service.enabled true" >> /home/user/spark-3.1.2-bin-hadoop3.2/conf/spark-defaults.conf
restarting Hadoop and Spark
I was able to start a pyspark session:
pyspark --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true
Upvotes: 0
Reputation: 1495
You need to configure external shuffle service on Yarn cluster by following
spark-<version>-yarn-shuffle.jar
. This should be under
$SPARK_HOME/common/network-yarn/target/scala- if you are
building Spark yourself, and under yarn if you are using a
distribution.yarn-site.xml
on each node, add spark_shuffle
to yarn.nodemanager.aux-services
, then set
yarn.nodemanager.aux-services.spark_shuffle.class
to
org.apache.spark.network.yarn.YarnShuffleService
.YARN_HEAPSIZE
(1000 by default)
in etc/hadoop/yarn-env.sh
to avoid garbage collection issues during
shuffle.For details, please refer https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service
If still not working, check below:
--deploy-mode cluster
to ensure driver could communicate with yarn cluster for schedulingUpvotes: 2