Knight71
Knight71

Reputation: 2949

when should groupByKey API used in spark programming?

GroupByKey suffers from shuffling the data.And GroupByKey functionality can be achieved either by using combineByKey or reduceByKey.So When should this API be used ? Is there any use case ?

Upvotes: 4

Views: 426

Answers (4)

Knight71
Knight71

Reputation: 2949

Avoid GroupByKey when the data in the merge field will be reduced to single value . Eg. In case of sum for a particular key.

Use GroupByKey when you know that merge field is not going to be reduced to single value. Eg: List reduce(_++_) --> Avoid this.

The reason being reduce a list will create memory both map side and reduce side. Memory that is created on executor that doesn't own the key will be wasted during shuffle. Good example would be TopN.

More on this - https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey

Upvotes: 1

ayan guha
ayan guha

Reputation: 1257

I woud say if groupByKey is last transformation in your chain of work (or you do anything after that has narrow dependency only), they you may consider it.

The reason reducebyKey is preferred is 1. Combine as alister mentioned above 2. ReduceByKey also partitions the data so that sum/agg becomes narrow ie can happen within partitions

Upvotes: 0

Alister Lee
Alister Lee

Reputation: 2455

Combine and reduce will also eventually shuffle, but they have better memory and speed performance characteristics because they are able to do more work to reduce the volume of data before the shuffle.

Consider if you had to sum a numeric attribute by a group RDD[(group, num)]. groupByKey will give you RDD[(group, List[num])] which you can then manually reduce using map. The shuffle would need to move all the individual nums to the destination partitions/nodes to get that list - many rows being shuffled.

Because reduceByKey knows that what you are doing with the nums (ie. summing them), it can sum each individual partition before the shuffle - so you'd have at most one row per group being written out to shuffle partition/node.

Upvotes: 3

ChromeHearts
ChromeHearts

Reputation: 1771

According to the link below, GroupByKey should be avoided.

Avoid GroupByKey

Upvotes: 1

Related Questions