Hitesh
Hitesh

Reputation: 462

Window Vs GroupBy Performance in Spark

I am newbie to spark and I have observed that there are some cases that a window function approach and groupBy approach are alternatives for each other.Here,I want to understand from performance perspective which one is better and why?Both approaches will cause re-shuffling of data, but under which scenarios one will be effective over the other?

Upvotes: 1

Views: 968

Answers (1)

Raphael Roth
Raphael Roth

Reputation: 27383

From my understanding, groupBy is more performant because it uses partial aggregaton. So using groupBy, not all records are shuffled but only the partial aggregators (for example for avg, that would be sum and count).

On the other hand, window-function will always shuffle your records and aggregation is done afterward and should therefore be slower.

But in reality there is not the choice groupBy vs window-functions since in most cases you will need to combine groupBy results with a join with the original data (which can be expensive unless you can use broadcast join), and more often you cannot achieve the logic with groupBy (running sum/average, lead/lag etc).

But unfortunately, there is very little (official) literature about topics like that...

Upvotes: 2

Related Questions