Reputation: 462
I am newbie to spark and I have observed that there are some cases that a window function approach and groupBy approach are alternatives for each other.Here,I want to understand from performance perspective which one is better and why?Both approaches will cause re-shuffling of data, but under which scenarios one will be effective over the other?
Upvotes: 1
Views: 968
Reputation: 27383
From my understanding, groupBy
is more performant because it uses partial aggregaton. So using groupBy
, not all records are shuffled but only the partial aggregators (for example for avg
, that would be sum and count).
On the other hand, window-function will always shuffle your records and aggregation is done afterward and should therefore be slower.
But in reality there is not the choice groupBy
vs window-functions since in most cases you will need to combine groupBy
results with a join with the original data (which can be expensive unless you can use broadcast join), and more often you cannot achieve the logic with groupBy
(running sum/average, lead/lag etc).
But unfortunately, there is very little (official) literature about topics like that...
Upvotes: 2