RedSoft
RedSoft

Reputation: 371

INNER JOIN with complex condition dramatically increases the execution time

I have 2 tables with several identical fields needed to be linked in JOIN condition. E.g. in each table there are fields: P1, P2. I want to write the following join query:

SELECT ... FROM Table1
   INNER JOIN
   Table2
      ON    Table1.P1 = Table2.P1
         OR Table1.P2 = Table2.P2
         OR Table1.P1 = Table2.P2
         OR Table1.P2 = Table2.P1

In the case I have huge tables this request is executing a lot of time.

I tried to test how long will be the request of a query with one condition only. First, I have modified the tables in such way all data from P2 & P1 where copied as new rows into Table1 & Table2. So my query is simple:

SELECT ... FROM Table1 INNER JOIN Table2 ON Table1.P = Table2.P

The result was more then surprised: the execution time from many hours (the 1st case) was reduced to 2-3 seconds!

Why is it so different? Does it mean the complex conditions are always reduce performance? How can I improve the issue? May be P1,P2 indexing will help? I want to remain the 1st DB schema and not to move to one field P.

Upvotes: 1

Views: 4885

Answers (3)

Kaushal
Kaushal

Reputation: 471

you can use 4 query and Union there result

SELECT ... FROM Table1
INNER JOIN
Table2
  ON    Table1.P1 = Table2.P1
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
  ON   Table1.P1 = Table2.P2
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
  ON    Table1.P2 = Table2.P1
UNION
SELECT ... FROM Table1
INNER JOIN
Table2
  ON   Table1.P2 = Table2.P2

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1269773

The reason the queries are different is because of the join strategies being used by the optimizer. There are basically four ways that two tables can be joined:

  1. "Hash join": Creates a hash table on one of the tables which it uses to look up the values in the second.
  2. "Merge join": Sorts both tables on the key and then readsthe results sequentially for the join.
  3. "Index lookup": Uses an index to look up values in one table.
  4. "Nested Loop": Compars each value in each table to all the values in the other table.

(And there are variations on these, such as using an index instead of a table, working with partitions, and handling multiple processors.) Unfortunately, in SQL Server Management Studio both (3) and (4) are shown as nested loop joins. If you look more closely, you can tell the difference from the parameters in the node.

In any case, your original join is one of the first three -- and it goes fast. These joins can basically only be used on "equi-joins". That is, when the condition joining the two tables includes an equality operator.

When you switch from a single equality to an "in" or set of "or" conditions, the join condition has changed from an equijoin to a non-equijoin. My observation is that SQL Server does a lousy job of optimization in this case (and, to be fair, I think other databases do pretty much the same thing). Your performance hit is the hit of going from a good join algorithm to the nested loops algorithm.

Without testing, I might suggest some of the following strategies.

  1. Build an index on P1 and P2 in both tables. SQL Server might use the index even for a non-equijoin.
  2. Use the union query suggested in another solution. Each query should be correctly optimized.
  3. Assuming these are 1-1 joins, you can also do this as a set of multiple joins:

    from table1 t1 left outer join table2 t2_11 on t1.p1 = t2_11.p1 left outer join table2 t2_12 on t1.p1 = t2_12.p2 left outer join table2 t2_21 on t1.p2 = t2_21.p2 left outer join table2 t2_22 on t1.p2 = t2_22.p2

And then use case/coalesce logic in the SELECT to get the value that you actually want. Although this may look more complicated, it should be quite efficient.

Upvotes: 5

whytheq
whytheq

Reputation: 35557

Does using CTEs help performance?

;WITH Table1_cte 
AS
(
SELECT 
      ...
      [P] = P1
FROM Table1
UNION   
SELECT 
      ...
      [P] = P2
FROM Table1
)
, Table2_cte 
AS
(
SELECT 
      ...
      [P] = P1
FROM Table2
UNION   
SELECT 
      ...
      [P] = P2
FROM Table2
)
SELECT ... FROM Table1_cte x
   INNER JOIN
   Table2_cte y
      ON x.P = y.P

I suspect, as far as the processor is concerned, the above is just different syntax for the same complex conditions.

Upvotes: 0

Related Questions