Repartition already partitioned dataset effectively in order to combine small files into bigger ones

Question

Is there a way to repartition already partitioned dataset for the sake of reducing number of files within single partition effectively, i.e. without shuffling? For example, if have dataset partitioned by some key:

key=1/
  part1
  ..
  partN
key=2/
  part1
  ..
  partN
..
key=M
  part1
  ..
  partN

I can just do the following:

spark.read
  .parquet("/input")
  .repartition("key")
  .write
  .partitionBy("key")
  .parquet("/output")

I expect that all data from single partition should land in the same executor but it seems to work differently and a lot of shuffling involved. Am I doing something wrong there? Data is stored in Parquet and I'm using Spark 2.4.3.

Repartition already partitioned dataset effectively in order to combine small files into bigger ones

Answers (1)

Related Questions