Reputation: 725
I am trying to execute my Spark job in yarn-cluster mode. It is working fine with standalone and yarn-client mode, but in cluster mode it is throwing FileAlreadyExistsException
at pairs.saveAsTextFile(output);
Here is my implementation of job:
SparkConf sparkConf = new SparkConf().setAppName("LIM Spark PolygonFilter").setMaster(master);
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
Broadcast<IGeometry> boundryBroadCaster = broadcastBoundry(javaSparkContext, boundaryPath);
JavaRDD<String> file = javaSparkContext.textFile(input);//.cache();
JavaRDD<String> pairs = file.filter(new FilterFunction(params , boundryBroadCaster));
pairs.saveAsTextFile(output);
As per logs, it works for one node and after that it start throwing this exception for rest of all nodes.
Can someone please help me to fix it ... ? Thanks.
Upvotes: 4
Views: 416
Reputation: 725
After disabling output spec it is working: (spark.hadoop.validateOutputSpecs=true).
It looks like a feature of Hadoop to notify user that the specified output directory is already has some data and it will be lost if you will use same directory for next iteration of this job.
In my application i provided an extra parameter for job - -overwrite, and we are using it like this:
spark.hadoop.validateOutputSpecs = value of overwrite flag
If user want to overwrite existing output than he can provide value of "overwrite" flag as true.
Upvotes: 1