Spark: is it better to break queries down into several dfs, or do it all at once?

Question

What is best practice in Spark, one larger, query, e.g:

results = spark.sql (' \
  select t1.col1, t2.col2 \
  from table1 t1 \
  left join table2 t2 \
    on t1.id = t2.id \
  where \
    t1.value1 < 50 \
  and \
    t2.value2 > 100 ')
results.show()

or breaking this down into step-by-step dfs, e.g.:

df1 = spark.sql (' \
  select col1, id, value1 \
  from table1')

df2 = spark.sql (' \
  select col2, id, value2 \
  from table2')

df_joined = df1.join(df2, on='id', how='left')
df_where1 = df_joined.filter(col('value1') < 50)
results = df_where1.filter(col('value2') > 100)
results.show()

The idea is, the two approaches arrive at the same place, only #1 does it in one query, and #2 does it by storing each intermediate step in a new df

The latter way is good for debugging, as you can easily check the data at each step to see where problems might lie.

But is it less efficient? Does the second way force Spark to process things in a way that it might not do, if it had free reign? For example, does it send each step to the executors one at a time, rather than distributing parts and doing them in parallel?

Or, does it not matter, because each intermediate step is lazy? Does Spark combine each step when deciding how to do things anyway?

Raphael Roth · Accepted Answer

As long as you don't do anything with the intermediate dataframes (e.g. calling show etc), the query-plan (and thus performance) will be exactly the same

Spark: is it better to break queries down into several dfs, or do it all at once?

Answers (1)

Related Questions