Reputation: 175
In our project, we are using repartition(1)
to write data into table, I am interested to know why coalesce(1)
cannot be used here because repartition
is a costly operation compared to coalesce
.
I know repartition
distributes data evenly across partitions, but when the output file is of single part file, why can't we use coalesce(1)
?
Upvotes: 7
Views: 15364
Reputation: 439
coalesce
has an issue where if you're calling it using a number smaller than your current number of executors, the number of executors used to process that step will be limited by the number you passed in to the coalesce function.
The repartition
function avoids this issue by shuffling the data. In any scenario where you're reducing the data down to a single partition (or really, less than half your number of executors), you should almost always use repartition
over coalesce because of this. The shuffle caused by repartition is a small price to pay compared to the single-threaded operation of a call to coalesce(1)
Upvotes: 12
Reputation: 18108
You state nothing else in terms of logic.
coalesce
will use existing partitions to minimize shuffling. In case of coalsece(1) and counterpart may be not a big deal, but one can take this guiding principle that repartition
creates new partitions and hence does a full shuffle. That said, coalsece can be said to minimize the amount of shuffling.
In my spare time I chanced upon this https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908 excellent article. Look for the quote: Coalesce sounds useful in some cases, but has some problems.
Upvotes: 4