Reputation: 687
I am testing a in-house relational DB and I cannot figure out the why the optimizer could figure out this query plan.
SELECT * from A LEFT JOIN B on A.x = B.x INNER JOIN C on B.y = C.y
The execution plan reported by this database could be written like below psuedo-code:
For each tupleB in B
For each tupleC in C
INNER JOIN tupleB and tupleC
For each tupleA in A
INNER JOIN tupleA
The result for this plan is correct.
If all the JOINs are INNER joins, this plan looks reasonable to me since INNER JOIN is both commutative and associative.
However, when the LEFT JOIN
and INNER JOIN
are mixed, how does the optimizer could figure out INNER_JOIN(INNER_JOIN(B,C),A)
share the same result with INNER_JOIN(LEFT_JOIN(A,B),C)
?
Is there a theory to prove this or this just happens case by case?
Upvotes: 0
Views: 819
Reputation: 10701
The equivalence of plans INNER_JOIN(LEFT_JOIN(A,B),C)
and INNER_JOIN(INNER_JOIN(B,C),A)
has two steps:
INNER_JOIN(LEFT_JOIN(A,B),C)
is equivalent to INNER_JOIN(INNER_JOIN(A,B),C)
INNER_JOIN(INNER_JOIN(A,B),C)
is equivalent to INNER_JOIN(INNER_JOIN(B,C),A)
The first equivalence is the more difficult to see. Once you perform LEFT_JOIN(A,B)
you may have rows from A
not having a counterpart in B
. These rows are the only rows that are not in INNER_JOIN(A,B)
. These rows will have NULL values in B
attributes. Subsequently, you perform an inner join with C
using B.y
and those extra rows have to disappear in the final result since the B.y
is NULL and the join condition B.y = C.y
is always evaluated to uknown
. Therefore, in the final result you have only those rows which are product of INNER_JOIN(A,B)
even if you process the LEFT_JOIN(A,B)
.
The second equivalence is commes from the join associativity
Upvotes: 1