Reputation: 2949
GroupByKey suffers from shuffling the data.And GroupByKey functionality can be achieved either by using combineByKey or reduceByKey.So When should this API be used ? Is there any use case ?
Upvotes: 4
Views: 426
Reputation: 2949
Avoid GroupByKey when the data in the merge field will be reduced to single value . Eg. In case of sum for a particular key.
Use GroupByKey
when you know that merge field is not going to be reduced to single value. Eg: List reduce(_++_)
--> Avoid this.
The reason being reduce a list will create memory both map side and reduce side. Memory that is created on executor that doesn't own the key will be wasted during shuffle. Good example would be TopN.
More on this - https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey
Upvotes: 1
Reputation: 1257
I woud say if groupByKey is last transformation in your chain of work (or you do anything after that has narrow dependency only), they you may consider it.
The reason reducebyKey is preferred is 1. Combine as alister mentioned above 2. ReduceByKey also partitions the data so that sum/agg becomes narrow ie can happen within partitions
Upvotes: 0
Reputation: 2455
Combine and reduce will also eventually shuffle, but they have better memory and speed performance characteristics because they are able to do more work to reduce the volume of data before the shuffle.
Consider if you had to sum a numeric attribute by a group RDD[(group, num)]. groupByKey
will give you RDD[(group, List[num])] which you can then manually reduce using map
. The shuffle would need to move all the individual num
s to the destination partitions/nodes to get that list - many rows being shuffled.
Because reduceByKey
knows that what you are doing with the num
s (ie. summing them), it can sum each individual partition before the shuffle - so you'd have at most one row per group
being written out to shuffle partition/node.
Upvotes: 3