Chari
Chari

Reputation: 26

Spark write to HDFS is slow

I have ORC data on HDFS (non partitioned), ~8billion rows, 250GB in size. Iam reading the data in DF, writing the DF without ay transformations using partitionBy ex: df.write.mode("overwrite").partitionBy("some_column").orc("hdfs path")

As i monitored job status in spark UI - the job and stage is getting completed in 20minutes. But "SQL" tab in spark UI is showing 40minutes.

After running the job in debug mode and going through spark log, i realised the tasks writing to "_temporary" are getting completed in 20minutes.

After that, the merge of "_temporary" to the actual output path is taking 20minutes.

So my question is, is Driver process merging the data from "_temporary" to the output path sequntially? Or is it done by executor tasks?

Is there anything i can do to improve the performance?

Upvotes: 0

Views: 2828

Answers (1)

mazaneicha
mazaneicha

Reputation: 9417

You may want to check spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version option in your app's config. With version 1, driver does commit temp. files sequentially, which has been known to create a bottleneck. But franky, people usually observe this problem only on a much larger number of files than in your case. Depending on the version of Spark, you may be able to set commit version to 2, see SPARK-20107 for details.

On a separate note, having 8 cores per executor is not recommended as it might saturate disk IO when all 8 tasks are writing output at once.

Upvotes: 1

Related Questions