Reputation: 18838
We have two approaches to selecting and filtering data from spark data frame df
. First:
df = df.filter("filter definition").select('col1', 'col2', 'col3')
and second:
df = df.select('col1', 'col2', 'col3').filter("filter definition")
Suppose we want to call the action of count
after that.
Which one is more performant if we can swap the place of filter
and select
in spark (i.e., in the definition of the filter
we used from the selected columns and not more)? Why? Is there any difference between the filter
and select
swapping for different actions?
Upvotes: 7
Views: 9475
Reputation: 2605
Spark ( in and above 1.6 version) uses catalyst optimiser for queries, so less performant query will be transformed to the efficient one.
Just to confirm you can call explain(true) on dataframe to check its optimised plan which are the same for both the queries.
PS: New changes are introduction of cost based optimiser.
Upvotes: 7
Reputation: 2487
Yes you can notice a difference if you are dealing with a huge amount of data where it has huge number of columns
df = df.filter("filter definition").select('col1', 'col2', 'col3')
This would filer on the condition first and then select the required columns
df = df.select('col1', 'col2', 'col3').filter("filter definition")
This is the other way around where it selects the columns first and applies the filter next
DIFFERENCE
It all depends if you are filtering based on the columns that you select its always better to use the select followed by the filer as it selects the columns before the filter where the amount of time for filer will reduce as there is an exponential increase in data but if you are applying the filter on some other columns then i would always recommend you to select the columns which are applying filter along with the columns you want and then apply the filer as its much faster compared to applying the filer on the entire DF
so always go with below to save time on the transformation.
df = df.select('col1', 'col2', 'col3').filter("filter definition")
Upvotes: -2