Apache Spark in yarn-cluster mode is throwing Hadoop FileAlreadyExistsException

Question

I am trying to execute my Spark job in yarn-cluster mode. It is working fine with standalone and yarn-client mode, but in cluster mode it is throwing FileAlreadyExistsException at pairs.saveAsTextFile(output);

Here is my implementation of job:

SparkConf sparkConf = new SparkConf().setAppName("LIM Spark PolygonFilter").setMaster(master);  
        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);            
        Broadcast boundryBroadCaster = broadcastBoundry(javaSparkContext, boundaryPath);         
        JavaRDD file = javaSparkContext.textFile(input);//.cache();     
        JavaRDD pairs = file.filter(new FilterFunction(params , boundryBroadCaster));
        pairs.saveAsTextFile(output);

As per logs, it works for one node and after that it start throwing this exception for rest of all nodes.

Can someone please help me to fix it ... ? Thanks.

Ajeet · Accepted Answer

After disabling output spec it is working: (spark.hadoop.validateOutputSpecs=true).

It looks like a feature of Hadoop to notify user that the specified output directory is already has some data and it will be lost if you will use same directory for next iteration of this job.

In my application i provided an extra parameter for job - -overwrite, and we are using it like this:

spark.hadoop.validateOutputSpecs = value of overwrite flag

If user want to overwrite existing output than he can provide value of "overwrite" flag as true.

Apache Spark in yarn-cluster mode is throwing Hadoop FileAlreadyExistsException

Answers (1)

Related Questions