Reputation: 26
I have ORC data on HDFS (non partitioned), ~8billion rows, 250GB in size. Iam reading the data in DF, writing the DF without ay transformations using partitionBy ex: df.write.mode("overwrite").partitionBy("some_column").orc("hdfs path")
As i monitored job status in spark UI - the job and stage is getting completed in 20minutes. But "SQL" tab in spark UI is showing 40minutes.
After running the job in debug mode and going through spark log, i realised the tasks writing to "_temporary" are getting completed in 20minutes.
After that, the merge of "_temporary" to the actual output path is taking 20minutes.
So my question is, is Driver process merging the data from "_temporary" to the output path sequntially? Or is it done by executor tasks?
Is there anything i can do to improve the performance?
Upvotes: 0
Views: 2828
Reputation: 9417
You may want to check spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version
option in your app's config. With version 1, driver does commit temp. files sequentially, which has been known to create a bottleneck. But franky, people usually observe this problem only on a much larger number of files than in your case. Depending on the version of Spark, you may be able to set commit version to 2, see SPARK-20107 for details.
On a separate note, having 8 cores per executor is not recommended as it might saturate disk IO when all 8 tasks are writing output at once.
Upvotes: 1