Spark partitioning for file write is very slow

Question

When writing a file to HDFS using Spark, this is quite fast when not using partitioning. Instead of that, when I use partitioning for writing the file, the write delay increases by factor ~24.

For the same file, writing without partition takes around 600ms. Writing with partition by Id (will generate exactly 1.000 partitions, as there are 1.000 ids in the file) it takes around 14 seconds.

Do some of you have the same experience that writing a partitioned file takes very long time? What is the root cause of this, perhaps that Spark needs to create 1.000 folders and files for each partition? Do you have an idea how this can be speeded up?

val myRdd = streamedRdd.map { case ((id, metric, time), value) => Record(id, metric, getEpoch(time), time, value) }

val df = myRdd.toDF

df.write.mode(SaveMode.Append)
.partitionBy("id")
.parquet(path)

Spark partitioning for file write is very slow

Answers (1)

Related Questions