nagraj036
nagraj036

Reputation: 175

Difference between repartition(1) and coalesce(1)

In our project, we are using repartition(1) to write data into table, I am interested to know why coalesce(1) cannot be used here because repartition is a costly operation compared to coalesce.

I know repartition distributes data evenly across partitions, but when the output file is of single part file, why can't we use coalesce(1)?

Upvotes: 7

Views: 15364

Answers (2)

Nolan Barth
Nolan Barth

Reputation: 439

coalesce has an issue where if you're calling it using a number smaller than your current number of executors, the number of executors used to process that step will be limited by the number you passed in to the coalesce function.

The repartition function avoids this issue by shuffling the data. In any scenario where you're reducing the data down to a single partition (or really, less than half your number of executors), you should almost always use repartition over coalesce because of this. The shuffle caused by repartition is a small price to pay compared to the single-threaded operation of a call to coalesce(1)

Upvotes: 12

Ged
Ged

Reputation: 18108

You state nothing else in terms of logic.

Upvotes: 4

Related Questions