Reputation: 317
Is there a way to repartition already partitioned dataset for the sake of reducing number of files within single partition effectively, i.e. without shuffling? For example, if have dataset partitioned by some key
:
key=1/
part1
..
partN
key=2/
part1
..
partN
..
key=M
part1
..
partN
I can just do the following:
spark.read
.parquet("/input")
.repartition("key")
.write
.partitionBy("key")
.parquet("/output")
I expect that all data from single partition should land in the same executor but it seems to work differently and a lot of shuffling involved. Am I doing something wrong there? Data is stored in Parquet and I'm using Spark 2.4.3.
Upvotes: 0
Views: 285
Reputation: 89
You need to coalesce before the write.
val n = 1 //number of desired part files
spark.read
.parquet("/input")
.repartition($"key") //requires column
.coalesce(n)
.write
.partitionBy("key")
.parquet("/output")
Upvotes: 1