PATIL DADA
PATIL DADA

Reputation: 379

How to improve performance of this SQL Server query?

I was asked this question at web developer interview. after my answer interviewer said your in second table :(

I have two tables employee and bademployee:

Now, I want to select only good employees.

My answer was :

SELECT * 
FROM employee 
WHERE empid NOT IN (SELECT badempid from bademployee)

He said this query is not good for performance.

Can any one tell me how to write query for same result, by not using negative terms(not in, !=).

Can it be done using LEFT OUTER JOIN ?

Upvotes: 1

Views: 157

Answers (3)

Dan Guzman
Dan Guzman

Reputation: 46193

I think the interviewer was wrong about the performance difference. Because the joined column is unique and not null in both tables, the NOT IN, NOT EXISTS, and LEFT JOIN...WHERE IS NULL queries are semantically identical. SQL is a declarative language so the SQL Server optimizer may provide optimal and identical plans regardless of now the query is expressed. That said, it is not always perfect so there may be variances, especially with more complex queries.

Below is a script that demonstrates this. On my SQL Server 2014 box, I see identical execution plans for the first 2 queries (ordered clustered index scans and a merge join), and the addition of a filter operator in the last. I would expect identical performance with all 3 so it doesn't really matter from a performance perspective. I would generally use NOT EXISTS because the intent is clearer and it avoids the gotcha in the case a NULL is returned by the NOT IN subquery, thus resulting in zero rows returned due to the UNKNOWN predicate result.

I would not generalize performance comparisons like this. If the joined columns allow NULL or are not guaranteed to be unique, these queries are not semantically the same and may yield different execution plans as a result.

CREATE TABLE dbo.employee (
    empid int CONSTRAINT pk_employee PRIMARY KEY
    , name varchar(20)
    );

CREATE TABLE dbo.bademployee (
      badempid int CONSTRAINT pk_bademployee PRIMARY KEY
    , name varchar(20)
    );

WITH 
    t4 AS (SELECT n FROM (VALUES(0),(0),(0),(0)) t(n))
    ,t256 AS (SELECT 0 AS n FROM t4 AS a CROSS JOIN t4 AS b CROSS JOIN t4 AS c CROSS JOIN t4 AS d)
    ,t16M AS (SELECT ROW_NUMBER() OVER (ORDER BY (a.n)) AS num FROM t256 AS a CROSS JOIN t256 AS b CROSS JOIN t256 AS c)
INSERT INTO dbo.employee(empid, name)
SELECT num, 'Employee name ' + CAST(num AS varchar(10))
FROM t16M
WHERE num <= 10000;

INSERT INTO dbo.bademployee(badempid, name)
SELECT TOP 5 PERCENT empid, name
FROM dbo.employee
ORDER BY NEWID();
GO
UPDATE STATISTICS dbo.employee WITH FULLSCAN;
UPDATE STATISTICS dbo.bademployee WITH FULLSCAN;
GO

SELECT * 
FROM employee 
WHERE empid NOT IN (SELECT badempid from bademployee);

SELECT *
FROM Employee e
WHERE NOT EXISTS (
    SELECT 1
    FROM bademployee b
    WHERE e.empid = b.badempid);

SELECT e.*
FROM Employee e
    LEFT JOIN bademployee b ON e.empid = b.badempid
WHERE b.badempid IS NULL;
GO

Upvotes: 1

TT.
TT.

Reputation: 16137

Whatever someone else may say, you need to check the execution plan and base your conclusion on what that sais. Never just trust someone else that claims this or that, research into his claims and verify that with documentation on the subject and in this case the execution plan which clearly tells you what is going on.

One example from SQL Authority blogs shows that the LEFT JOIN solution performs much worse than the NOT IN solution. This is due to a LEFT ANTI SEMI JOIN done by the query planner which generally performs a lot better than a LEFT JOIN + NULL check. There may be exceptions when there are very few rows. The author also tells you afterwards the same as I did in the first paragraph: always check the execution plan.

Another blog post from SQL Performance blogs goes into this further with actual performance testing results.

TL;DR: In terms of performance NOT EXISTS and NOT IN are on the same level but NOT EXISTS is prefered due to issues with NULL values. Also, don't just trust what anyone claims, research and verify your execution plan.

Upvotes: 1

sgeddes
sgeddes

Reputation: 62831

This can be rewritten using an OUTER JOIN with a NULL check or by using NOT EXISTS. I prefer NOT EXISTS:

SELECT *
FROM Employee e
WHERE NOT EXISTS (
    SELECT 1
    FROM bademployee b
    WHERE e.empid = b.badempid)

Here is the OUTER JOIN, but I believe you'll have better performace with NOT EXISTS.

SELECT e.*
FROM Employee e
    LEFT JOIN bademployee b ON e.empid = b.badempid
WHERE b.badempid IS NULL

Here's an interesting article about the performance differences: http://sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join

Upvotes: 3

Related Questions