Reputation: 3912

Performance of "NOT IN" in SQL query

I'm quite new to SQL query analysis. Recently I stumbled upon a performance issue with one of the queries and I'm wondering whether my thought process is correct here and why Query Optimizer works the way it works in this case. I'm om SQL Server 2012. I've got a SQL query that looks like

SELECT * FROM T1 
WHERE Id NOT IN 
(SELECT DISTINCT T1_Id from T2);

It takes around 30 seconds to run on my test server. While trying to understand what is taking so long I rewrote it using a temp table, like this:

SELECT DISTINCT T1_Id 
INTO #temp from T2;
SELECT * FROM T1 
WHERE Id NOT IN  
(SELECT T1_Id from #temp);

It runs a hundred times faster than the first one.

Some info about the tables: T2 has around 1 million rows, and there are around 1000 distinct values of T1_id there. T1 has around 1000+ rows. Initially I only had a clustered index on T2 on a column other than T1_Id, so T1_id wasn't indexed at all.

Looking at the execution plans, I saw that for the first query there were as many index scans as there are distinct T1_id values, so basically SQL Server performs about 1000 index scans in this case. That made me realize that adding a non-clustered index on T1_id may be a good idea (the index should've been there from the start, admittedly), and adding an index indeed made the original query run much faster since now it does nonclustered index seeks.

What I'm looking for is to understand the Query optimizer behavior for the original query - does it look reasonable? Are there any ways to make it work in a way similar to the temporary table variant that I posted here rather than doing multiple scans? Am I just misunderstanding something here? Thanks in advance for any links to the similar discussion as I haven't really found anything useful.

Upvotes: 4

Answers (4)

Branko Dimitrijevic

Reputation: 52107

This is just a guess, but hopefully an educated one...

The DBMS probably concluded that searching a large table small number of times is faster than searching a small table large number of times. That's why you had ~1000 searches on T2, instead of ~1000000 searches on T1.

When you added an index on T2.T1_Id, that turned ~1000 table scans (or full clustered index scans if the table is clustered) into ~1000 index seeks, which made things much faster, as you already noted.

I'm not sure why it didn't attempt a hash join (or a merge join after the index was added) - perhaps it had stale statistics and badly overestimated the number of distinct values?

One more thing: is there a FOREIGN KEY on T2.T1_Id referencing T1.Id? I know Oracle can use FKs to improve the accuracy of cost estimates (in this case, it could infer that the cardinality of T2.T1_Id cannot be greater than T1.Id). If MS SQL Server does something similar, and the FK is missing (or is untrusted), that could contribute to the MS SQL Server thinking there are more distinct values than there really are.

(BTW, it would have helped if you posted the actual query plans and the database structure.)

Upvotes: 0

Gordon Linoff

Reputation: 1269445

The SQL Server optimizer needs to understand the size if tables for some of its decisions.

When doing a NOT IN with a subquery, those estimates may not be entirely accurate. When the table is actually materialized, the count would be highly accurate.

I think the first would be faster with an index on

Table2(t1_id)

Upvotes: 1

Bohemian

Reputation: 424953

The actual performance will likely vary from the estimates, but neither of your queries will out-perform this query, which is the de facto standard approach:

SELECT T1.* FROM T1 
LEFT JOIN T2 ON T1.Id = T2.T1_Id
WHERE T2.T1_Id IS NULL

This uses a proper join, which will perform very well (assuming the foreign key column is indexed) and being an left (outer) join the WHERE condition selects only those rows from T1 that don't join (all columns of the right side table are null when the join misses).

Note also that DISTINCT is not required, since there is only ever one row returned from T1 for missed joins.

Upvotes: 2

Dan Bracuk

Reputation: 20794

Not in is intuitive but slow. This construct will generally run quicker.

where id in 
(select id from t1
except select t1_id from t2)

Upvotes: 2

Performance of &quot;NOT IN&quot; in SQL query

Answers (4)

Related Questions

Performance of "NOT IN" in SQL query