Ivan Sharamet
Ivan Sharamet

Reputation: 317

Repartition already partitioned dataset effectively in order to combine small files into bigger ones

Is there a way to repartition already partitioned dataset for the sake of reducing number of files within single partition effectively, i.e. without shuffling? For example, if have dataset partitioned by some key:

key=1/
  part1
  ..
  partN
key=2/
  part1
  ..
  partN
..
key=M
  part1
  ..
  partN

I can just do the following:

spark.read
  .parquet("/input")
  .repartition("key")
  .write
  .partitionBy("key")
  .parquet("/output")

I expect that all data from single partition should land in the same executor but it seems to work differently and a lot of shuffling involved. Am I doing something wrong there? Data is stored in Parquet and I'm using Spark 2.4.3.

Upvotes: 0

Views: 285

Answers (1)

Mi7flat5
Mi7flat5

Reputation: 89

You need to coalesce before the write.

val n = 1 //number of desired part files
spark.read
  .parquet("/input")
  .repartition($"key") //requires column
  .coalesce(n)
  .write
  .partitionBy("key")
  .parquet("/output")

Upvotes: 1

Related Questions