Nick
Nick

Reputation: 679

Simple join takes far too long to run due to query plan

Some context: I have two tables, smalltable and bigtable. Smalltable contains 10,000 rows, whereas bigtable contains 2,000,000, and I am using SQL Server 2008. I started with a query as follows:

select * from [dbo].[smalltable]  t1 
INNER JOIN
[dbo].[bigtable] t2 
on (t1.name1=t2.firstname and t1.name6=t2.lastname) or (t1.name6=t2.firstname and t1.name1=t2.lastname)

This query was running for over 15 minutes before I killed it - on inspecting the query plan, it was using a nested loop to do the inner join.

I then rewrote the query as follows:

select * from [dbo].[smalltable]  t1 
INNER JOIN
[dbo].[bigtable] t2 
on (t1.name1=t2.firstname and t1.name6=t2.lastname)
UNION
select * from [dbo].[smalltable]  t1 
INNER JOIN
[dbo].[bigtable] t2 
on (t1.name6=t2.firstname and t1.name1=t2.lastname)

The two queries above then instead executed using a Hash Match, and the whole query ran in 4 seconds.

My question is, why does SQL Server get the query plan so wrong, and was there anyway that I could have fixed the original query without rewriting it? I tried adding a hint to use a Hash Match to the first query, but it seems that you are not allowed to with multiple join criteria?

Update: Added examples of the kind of data in the tables as requested. Note, the code is looking for name matches where names may have been swapped around:

Smalltable(Columns name1,name6)

John, Smith
Johnny, Smith
Smythe, Jon
Michaels, Robert
Bob, Brown

Bigtable (Columns firstname,lastname)

John, Smith
John, Smythe
Johnny, Smith
Alison, Roberts
Robert, Michaels
Janet, Green

Upvotes: 0

Views: 3176

Answers (1)

Vojtěch Dohnal
Vojtěch Dohnal

Reputation: 8104

It is a problem within SQL Server optimizer.

The condition (t1.name1=t2.firstname and t1.name6=t2.lastname) uses clustered index seek only and thus is very fast and executes almost instantly even with very large tables.

But the condition with OR

(t1.name1=t2.firstname and t1.name6=t2.lastname) or (t1.name6=t2.firstname and t1.name1=t2.lastname)

generally performs much worse usually performing full scan. You should see execution plans for your queries.

The Query Optimizer will always perform a table scan or a clustered index scan on a table if the WHERE clause in the query contains an OR operator and if any of the referenced columns in the OR clause are not indexed (or do not have a useful index). Because of this, if you use many queries with OR clauses, you will want to ensure that each referenced column in the WHERE clause has an index.

If you have a query that uses ORs and it is not making the best use of indexes, consider rewriting it as a UNION and then testing performance. Only through testing can you be sure that one version of your query will be faster than another.

See here. So shortly your first OR query does not make good use of indexes.

I believe there is no other way than rewrite the query with UNION (as you did) or APPLY, optimizer will not do it.

Upvotes: 2

Related Questions