Reputation: 2187
I have a SQL query Like below -
A left JOIN B Left Join C Left JOIN D
Say table A is a big table whereas tables B, C, D are small.
Will Spark join will execute like- A with B and subsequent results will be joined with C then D
or,
Spark will automatically optimize i.e it will join B, C and D and then results will be joined with A.
My question is what is order of execution or join evaluation? Does it go left to right or right to left?
Upvotes: 2
Views: 1819
Reputation: 14871
Spark can optimize join order, if it has access to information about cardinialities of those joins.
For example, if those are parquet tables or cached dataframes, then it has estimates on total counts of the tables, and can reorder join order to make it less expensive. If a "table" is a jdbc dataframe, Spark may not have information on row counts.
Spark Query Optimizer can also choose a different join type in case if it has statistics (e.g. it can broadcast all smaller tables, and run broadcast hash join
instead of sort merge join
).
If statistics aren't available, then it'll will just follow the order as in the SQL query, e.g. from left to right.
Update:
I originally missed that all the joins in your query are OUTER joins (left
is equivalent to left outer
).
Normally outer joins can't be reordered, because this would change result of the query. I said "normally" because sometimes Spark Optimizer can convert an outer join to an inner join (e.g. if you have a WHERE
clause that filters out NULLs - see conversion logic here).
For completeness of the answer, reordering of joins is driven by two different codepaths, depending is Spark CBO is enabled or not (spark.sql.cbo.enabled
first appeared in Spark 2.2 and is off by default). If spark.sql.cbo.enabled
=true and spark.sql.cbo.joinReorder.enabled
=true (also off by default), and statistics are available/collected manually through ANALYZE TABLE .. COMPUTE STATISTICS
then reordering is based on estimated cardinality of the join I mentioned above.
Proof that reordering only works for INNER JOINS is here (on example of CBO).
Update 2: Sample queries that show that reordering of outer joins produce different results, so outer joins are never reordered :
Upvotes: 3
Reputation: 1269493
The order of interpretation of join
s does not matter for inner joins. However, it can matter for outer joins.
Your logic is equivalent to:
FROM ((A LEFT JOIN
B
) ON . . . LEFT JOIN
C
ON . . . LEFT JOIN
)
D
ON . . .
The simplest way to think about chains of LEFT JOIN
is that they keep all rows in the first table and columns from matching rows in the subsequent tables.
Note that this is the interpretation of the code. The SQL optimizer is free to rearrange the JOIN
s in any order to arrive at the same result set (although with outer joins this is generally less likely than with inner joins).
Upvotes: 1