Praveen Gr
Praveen Gr

Reputation: 197

Will Spark Coalesce perform Shuffle

Assume I have a 8 node Spark cluster with 8 partitions (i.e each node has 1 partitions) Now if I try to reduce the number of partitions to 4 by using coalesce(4), 1. Will coalesce perform shuffle ? 2. If yes, then in which nodes will the newly created 4 partitions reside ?

Upvotes: 0

Views: 1893

Answers (3)

Mohd Avais
Mohd Avais

Reputation: 247

Coalesce by default has shuffling flag set to False.

If you have to increase the partitions, you can either use coalesce with shuffle flag set to true(with false, partition remains unchanged) or use repartition

If you are decreasing partitions, better to use coalesce with flag set to False as it avoids full shuffle unlike repartition where shuffling is guaranteed. Coalesce with false shuffling moves data on 1 partition to another existing partition thereby avoiding full shuffle giving better performance. say, data from partitions 5,6,7,8 will be moved to existing partitions 1,2,3,4 rather than shuffling the data of all 8 partitions

Determining on which node data resides is decided by the partitioner you are using

Upvotes: 3

Naresh
Naresh

Reputation: 60

coalesce(numpartitions) - used to reduce the no of partitions without shuffling coalesce(numpartitions,shuffle=false) - spark won't perform any shuffling because of shuffle = false option and used to reduce the no of partitions coalesce(numpartitions,shuffle=true) - spark will perform shuffling because of shuffle = true option and used to reduce and increase the partitions

Example : Assume rdd with 8 partitions initially

rdd.coalesce(4) - will results 4 partitons as output rdd.coalesce(4,false) - will results 4 partitons as output rdd.coalesce(10,false) - will results 8 partitons as output (shuffle = false will be able to reduce the partitons but not able to increase) rdd.coalesce(4,true) - will results 4 partitons as output rdd.coalesce(10,true) - will results 10 partitons as output (shuffle = true will be able to able to increase partitons)

Upvotes: 0

Harjeet Kumar
Harjeet Kumar

Reputation: 524

If You Check Spark API documentation of Coalesce. Then that is following

coalesce(int numPartitions, boolean shuffle, scala.math.Ordering<T> ord)

by default the shuffle Flag is False. Repartition calls same method by changing shuffle flag to True. With This info, Now let us answer your question

To Change Number of Partitions From 8 to 4, Shuffle has to happen. But here you are explicitly saying No to shuffle. So Number of Partitions in This case will not change.

even If you Try to Increase number of Partitions, It will not change. Since shuffle flag is False. Hope This Helps

Cheers!

Upvotes: 4

Related Questions