Ajeet
Ajeet

Reputation: 725

Apache Spark in yarn-cluster mode is throwing Hadoop FileAlreadyExistsException

I am trying to execute my Spark job in yarn-cluster mode. It is working fine with standalone and yarn-client mode, but in cluster mode it is throwing FileAlreadyExistsException at pairs.saveAsTextFile(output);

Here is my implementation of job:

SparkConf sparkConf = new SparkConf().setAppName("LIM Spark PolygonFilter").setMaster(master);  
        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);            
        Broadcast<IGeometry> boundryBroadCaster = broadcastBoundry(javaSparkContext, boundaryPath);         
        JavaRDD<String> file = javaSparkContext.textFile(input);//.cache();     
        JavaRDD<String> pairs = file.filter(new FilterFunction(params , boundryBroadCaster));
        pairs.saveAsTextFile(output);

As per logs, it works for one node and after that it start throwing this exception for rest of all nodes.

Can someone please help me to fix it ... ? Thanks.

Upvotes: 4

Views: 416

Answers (1)

Ajeet
Ajeet

Reputation: 725

After disabling output spec it is working: (spark.hadoop.validateOutputSpecs=true).

It looks like a feature of Hadoop to notify user that the specified output directory is already has some data and it will be lost if you will use same directory for next iteration of this job.

In my application i provided an extra parameter for job - -overwrite, and we are using it like this:

spark.hadoop.validateOutputSpecs = value of overwrite flag

If user want to overwrite existing output than he can provide value of "overwrite" flag as true.

Upvotes: 1

Related Questions