Reputation: 142

Spark: repartition output by key

I'm trying to output records using the following code:

spark.createDataFrame(asRow, struct)
      .write
      .partitionBy("foo", "bar")
      .format("text")
      .save("/some/output-path")

I don't have a problem when the data is small. However when I'm processing ~600GB input, I am writing around 290k files and that includes small files per partition. Is there a way we could control the number of output files per partition? Because right now I am writing a lot of small files and it's not good.

Upvotes: 1

Answers (2)

Arnon Rotem-Gal-Oz

Reputation: 25909

Having lots of files is the expected behavior as each partition (resulting in whatever computation you had before the write) will write to the partitions you requested the relevant files

If you wish to avoid that you need to repartition before the write:

spark.createDataFrame(asRow, struct)
      .repartition("foo","bar")
      .write
      .partitionBy("foo", "bar")
      .format("text")
      .save("/some/output-path")

Upvotes: 1

Vladislav Varslavans

Reputation: 2934

You have multiple files per partition because each node writes output to its own file. That means that the only way how to have only single file per partition is to re-partition data before writing. Please note, that that will be very inefficient because data repartition will cause shuffling on your data.

Upvotes: 0

Spark: repartition output by key

Answers (2)

Related Questions