Fulco
Fulco

Reputation: 284

Possible causes of performance difference between two very similar Spark Dataframes

I am working on improving the performance of some Spark operations for a recommendation engine I am prototyping. I have stumbled upon a significant performance differences between to DataFrames I am using. Below the results of describe() on both.

df1 (fast, numPartitions = 4):

+-------+------------------+--------------------+
|summary|           item_id|          popularity|
+-------+------------------+--------------------+
|  count|            187824|              187824|
|   mean| 96693.34836868558|                 1.0|
| stddev|55558.023793621316|5.281958866780519...|
|    min|                 0|  0.9999999999999998|
|    max|            192806|                 1.0|
+-------+------------------+--------------------+

df2 (approx. 10x slower, numPartitions = ±170):

+-------+-----------------+-----------------+
|summary|          item_id|            count|
+-------+-----------------+-----------------+
|  count|           187824|           187824|
|   mean|96693.34836868558|28.70869537439305|
| stddev|55558.02379362146|21.21976457710462|
|    min|                0|                1|
|    max|           192806|              482|
+-------+-----------------+-----------------+

Both DataFrames are cached, the same size in terms of rows (187824) and columns (2), and have an identical item_id column. The main difference is that frame 1 contains float in the second column, whereas frame 2 contains integer.

It seems as if every operation for DataFrame 2 is much slower, ranging from simply the .describe().show() operation, to more elaborate .subtract().subtract().take(). In the latter case DataFrame 2 takes 18 seconds, as opposed to 2 seconds for frame 1 (almost 10 times slower!).

I don't know where to begin to look for an explanation of the cause of this difference. Any tips or nudges in the right direction are greatly appreciated.

UPDATE: as proposed by Viacheslav Rodionov the number of partitions of the dataframes seems to be the cause of the performance issues with df2.

Digging deeper, both dataframes are a result of .groupBy().agg().sortBy() operations on the same original dataframe. The .groupBy().agg() operation yields 200 partitions, and then .sortBy() returns respectively 4 and ±170 partitions, why could this be?

Upvotes: 3

Views: 3539

Answers (1)

Viacheslav Rodionov
Viacheslav Rodionov

Reputation: 2345

I'll start with looking at df.rdd.getNumPartitions()

Smaller number of bigger partitions is almost always a good idea as it allows data to be compressed better and to do more actual work rather then manipulating files.

The other thing to look at is how your data looks like. Is it appropriate to a task you're trying to do?

  • If it's ordered by date field which you're using to apply a BETWEEN operation it will be faster than just working with unsorted data.
  • If you work with specific months or years it will make sense to partition your data by them.
  • The same goes for IDs. If you work with certain IDs, put same ID 'closer' to each other by partitioning/sorting your dataset.

My rule of thumb when storing the data - first partition by few low-cardinality fields (booleans and dates mostly), then sort all other fields with sortWithinPartitions in data importance order. This way you will achieve the best compression rate (means faster processing time) and a better data locality (again faster processing time). But as always it all depends on your use case, always think how you work with your data and prepare it accordingly.

Upvotes: 4

Related Questions