Is there any preference on the order of select and filter in spark?

Question

We have two approaches to selecting and filtering data from spark data frame df. First:

df = df.filter("filter definition").select('col1', 'col2', 'col3')

and second:

df = df.select('col1', 'col2', 'col3').filter("filter definition")

Suppose we want to call the action of count after that. Which one is more performant if we can swap the place of filter and select in spark (i.e., in the definition of the filter we used from the selected columns and not more)? Why? Is there any difference between the filter and select swapping for different actions?

Anurag Sharma · Accepted Answer

Spark ( in and above 1.6 version) uses catalyst optimiser for queries, so less performant query will be transformed to the efficient one.

Just to confirm you can call explain(true) on dataframe to check its optimised plan which are the same for both the queries.

Query1 plan:

Query2 plan:

PS: New changes are introduction of cost based optimiser.

Is there any preference on the order of select and filter in spark?

Answers (2)

Related Questions