Reputation: 197
Assume I have a 8 node Spark cluster with 8 partitions (i.e each node has 1 partitions) Now if I try to reduce the number of partitions to 4 by using coalesce(4), 1. Will coalesce perform shuffle ? 2. If yes, then in which nodes will the newly created 4 partitions reside ?
Upvotes: 0
Views: 1893
Reputation: 247
Coalesce by default has shuffling flag set to False.
If you have to increase the partitions, you can either use coalesce with shuffle flag set to true(with false, partition remains unchanged) or use repartition
If you are decreasing partitions, better to use coalesce with flag set to False as it avoids full shuffle unlike repartition where shuffling is guaranteed. Coalesce with false shuffling moves data on 1 partition to another existing partition thereby avoiding full shuffle giving better performance. say, data from partitions 5,6,7,8 will be moved to existing partitions 1,2,3,4 rather than shuffling the data of all 8 partitions
Determining on which node data resides is decided by the partitioner you are using
Upvotes: 3
Reputation: 60
coalesce(numpartitions) - used to reduce the no of partitions without shuffling coalesce(numpartitions,shuffle=false) - spark won't perform any shuffling because of shuffle = false option and used to reduce the no of partitions coalesce(numpartitions,shuffle=true) - spark will perform shuffling because of shuffle = true option and used to reduce and increase the partitions
Example : Assume rdd with 8 partitions initially
rdd.coalesce(4) - will results 4 partitons as output rdd.coalesce(4,false) - will results 4 partitons as output rdd.coalesce(10,false) - will results 8 partitons as output (shuffle = false will be able to reduce the partitons but not able to increase) rdd.coalesce(4,true) - will results 4 partitons as output rdd.coalesce(10,true) - will results 10 partitons as output (shuffle = true will be able to able to increase partitons)
Upvotes: 0
Reputation: 524
If You Check Spark API documentation of Coalesce. Then that is following
coalesce(int numPartitions, boolean shuffle, scala.math.Ordering<T> ord)
by default the shuffle Flag is False. Repartition calls same method by changing shuffle flag to True. With This info, Now let us answer your question
To Change Number of Partitions From 8 to 4, Shuffle has to happen. But here you are explicitly saying No to shuffle. So Number of Partitions in This case will not change.
even If you Try to Increase number of Partitions, It will not change. Since shuffle flag is False. Hope This Helps
Cheers!
Upvotes: 4