Spark write to HDFS is slow

Question

I have ORC data on HDFS (non partitioned), ~8billion rows, 250GB in size. Iam reading the data in DF, writing the DF without ay transformations using partitionBy ex: df.write.mode("overwrite").partitionBy("some_column").orc("hdfs path")

As i monitored job status in spark UI - the job and stage is getting completed in 20minutes. But "SQL" tab in spark UI is showing 40minutes.

After running the job in debug mode and going through spark log, i realised the tasks writing to "_temporary" are getting completed in 20minutes.

After that, the merge of "_temporary" to the actual output path is taking 20minutes.

So my question is, is Driver process merging the data from "_temporary" to the output path sequntially? Or is it done by executor tasks?

Is there anything i can do to improve the performance?

Spark write to HDFS is slow

Answers (1)

Related Questions