groupByKey vs. aggregateByKey - where exactly does the difference come from?

Question

There is some scary language in the docs of groupByKey, warning that it can be "very expensive", and suggesting to use aggregateByKey instead whenever possible.

I am wondering whether the difference in cost comes from the fact, that for some aggregattions, the entire group never never needs to be collected and loaded to the same node, or if there are other differences in implementation.

Basically, the question is whether rdd.groupByKey() would be equivalent to rdd.aggregateByKey(Nil)(_ :+ _, _ ++ _) or if it would still be more expensive.

Knight71 · Accepted Answer

If you are reducing to single element instead of list.

For eg: like word count then aggregateByKey performs better because it will not cause shuffle as explained in the link performance of group by vs aggregate by.

But in your case you are merging to a list . In the case of aggregateByKey it will first reduce all the values for a key in a partition to a single list and then send the data for shuffle.This will create as many list as partitions and memory for that will be high.

In the case of groupByKey the merge happens only at one node responsible for the key. The number of list created will be only one per key here. In case of merging to list then groupByKey is optimal in terms of memory.

Also Refer: SO Answer by zero323

I am not sure about your use case. But if you can limit the number of elements in the list in the end result then certainly aggregateByKey / combineByKey will give much better result compared to groupByKey. For eg: If you want to take only top 10 values for a given key. Then you could achieve this efficiently by using combineByKey with proper merge and combiner functions than groupByKey and take 10.

groupByKey vs. aggregateByKey - where exactly does the difference come from?

Answers (2)

Related Questions