Reputation: 3728

Spark Python submission error : File does not exist: pyspark.zip

I'm trying to submit python spark application on yarn-cluster mode.

Seq(System.getenv("SPARK_HOME")+"/bin/spark-submit","--master",sparkConfig.getString("spark.master"),"--executor-memory",sparkConfig.getString("spark.executor-memory"),"--num-executors",sparkConfig.getString("spark.num-executors"),"python/app.py") !

I'm getting following error ,

Diagnostics: File does not exist: hdfs://xxxxxx:8020/user/hdfs/.sparkStaging/application_123456789_0138/pyspark.zip java.io.FileNotFoundException: File does not exist: hdfs://xxxxxx:8020/user/hdfs/.sparkStaging/application_123456789_0138/pyspark.zip

I found https://issues.apache.org/jira/browse/SPARK-10795

But the ticket is still open !

Upvotes: 2

Answers (6)

Abolfazl karimian

Reputation: 19

HADOOP_CONF_DIR variable must be set so spark can find this file.

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Set it in $SPARK_HOME/conf/spark-env.sh

Upvotes: 0

B. Robinson

Reputation: 116

I answered this here https://stackoverflow.com/a/55457870/3357812. For me, the key was that spark.hadoop.fs.defaultFS must be set in SparkConf inside Python.

yarn_conf = SparkConf().setAppName(_app_name) \
                    .setMaster("yarn") \
                    .set("spark.executor.memory", "4g") \
                    .set("spark.hadoop.fs.defaultFS", "hdfs://{}:8020".format(_fs_host)) \
                    .set("spark.hadoop.yarn.resourcemanager.hostname", _rm_host)\
                    .set("spark.hadoop.yarn.resourcemanager.address", "{}:8050".format(_rm_host))

Upvotes: 0

user3105943

Reputation: 13

Try to add HDFS name node property to yarn-site.xml:

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://your-name-hode-host-port:8989</value>
</property>

Ensure that YARN_CONF_DIR env variable points to the directory of yarn-site.xml

Upvotes: 0

Neeraj Jain

Reputation: 1343

This happens when you are trying to spark-submit a job with deploy-mode "cluster" and you are trying to set master as "local"; e.g.

val sparkConf = new SparkConf().setAppName("spark-pi-app").setMaster("local[10]");

You have two options: Option #1: Change the above line to:

val sparkConf = new SparkConf().setAppName("spark-pi-app");

and submit your job as

./bin/spark-submit --master yarn --deploy-mode cluster --driver-memory 512m --executor-memory 512m --executor-cores 1 --num-executors 3 --jars hadoop-common-{version}.jar,hadoop-lzo-{version}.jar --verbose --queue hadoop-queue --class "SparkPi" sparksbtproject_2.11-1.0.jar

Option #2: Submit your job with deploy-mode as "client"

./bin/spark-submit --master yarn --deploy-mode client --driver-memory 512m --executor-memory 512m --executor-cores 1 --num-executors 3 --jars hadoop-common-{version}.jar,hadoop-lzo-{version}.jar --verbose --queue hadoop-queue --class "SparkPi" sparksbtproject_2.11-1.0.jar

Upvotes: 3

Sumit Purohit

Reputation: 168

In my experience with scala jobs i have seen that the yarn-cluster cluster mode gives this error when the code is trying to setMaster("local") somewhere. Please try to remove any reference to setting a local "master".

Again, My answer is based on the scala behavior but hope this helps.

Upvotes: 2

Carlos Bribiescas

Reputation: 4427

Are you failing to create a proper spark context? I suspect that is the issue. I have also updated https://issues.apache.org/jira/browse/SPARK-10795

Upvotes: 0

Spark Python submission error : File does not exist: pyspark.zip

Answers (6)

Related Questions