Patel
Patel

Reputation: 139

I want to measure Spark's performance of dataframe aggregation. Count or Collect action?

I want to create performance results statistics of dataframe on Spark. I am calling count() action after groupBy and measuring the time it took.

df.groupBy('Student').sum('Marks').count()

However, I found that if I use collect() instead of count(), the results took 10 time more time. Why?

And, which method I should use count() or collect() if I am performing benchmarking tests like above.

Thanks.

Upvotes: 0

Views: 892

Answers (1)

Tim
Tim

Reputation: 3725

Spark dataframes use a query optimizer (called Catalyst) to speed up Spark pipelines. In this case, there are two possibilities for what's happening:

  1. Collect is just more expensive than count. It involves taking all of your data distributed across your cluster, serializing it, sending it across the network back to the driver, and deserializing it. Count involves just computing a number once per task and sending that (much smaller).

  2. Catalyst is actually just counting the number of unique "Student" values. The result of "sum" is never actually used, so it never needs to be computed!

Upvotes: 3

Related Questions