OmG
OmG

Reputation: 18838

Is there any preference on the order of select and filter in spark?

We have two approaches to selecting and filtering data from spark data frame df. First:

df = df.filter("filter definition").select('col1', 'col2', 'col3')

and second:

df = df.select('col1', 'col2', 'col3').filter("filter definition")

Suppose we want to call the action of count after that. Which one is more performant if we can swap the place of filter and select in spark (i.e., in the definition of the filter we used from the selected columns and not more)? Why? Is there any difference between the filter and select swapping for different actions?

Upvotes: 7

Views: 9475

Answers (2)

Anurag Sharma
Anurag Sharma

Reputation: 2605

Spark ( in and above 1.6 version) uses catalyst optimiser for queries, so less performant query will be transformed to the efficient one.

enter image description here

Just to confirm you can call explain(true) on dataframe to check its optimised plan which are the same for both the queries.

Query1 plan: enter image description here

Query2 plan: enter image description here

PS: New changes are introduction of cost based optimiser.

Upvotes: 7

Spark
Spark

Reputation: 2487

Yes you can notice a difference if you are dealing with a huge amount of data where it has huge number of columns

df = df.filter("filter definition").select('col1', 'col2', 'col3')

This would filer on the condition first and then select the required columns

df = df.select('col1', 'col2', 'col3').filter("filter definition")

This is the other way around where it selects the columns first and applies the filter next

DIFFERENCE

It all depends if you are filtering based on the columns that you select its always better to use the select followed by the filer as it selects the columns before the filter where the amount of time for filer will reduce as there is an exponential increase in data but if you are applying the filter on some other columns then i would always recommend you to select the columns which are applying filter along with the columns you want and then apply the filer as its much faster compared to applying the filer on the entire DF

so always go with below to save time on the transformation.

df = df.select('col1', 'col2', 'col3').filter("filter definition")

Upvotes: -2

Related Questions