tchock
tchock

Reputation: 245

Not Exists vs Not In: efficiency

I've always been under the assumption that not exists is the way to go instead of using a not in condition. However, I doing a comparison on a query I've been using, I noticed that the execution for the Not In condition actually appears to be faster. Any insight into why this could be the case, or if I've just made a horrible assumption up until this point, would be greatly appreciated!

QUERY 1:

SELECT DISTINCT 
a.SFAccountID, a.SLXID, a.Name FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
JOIN  _SLX_AccountChannel b WITH(NOLOCK)
ON a.SLXID = b.ACCOUNTID
JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK)
ON a.SFAccountID = c.SFAccountID
WHERE b.STATUS IN ('Active','Customer', 'Current')
AND c.Primary__C = 0
AND NOT EXISTS
(
SELECT 1 FROM [dbo].[Salesforce_Contacts] c2 WITH(NOLOCK)
WHERE a.SFAccountID = c2.SFAccountID
AND c2.Primary__c = 1
);

QUERY 2:

SELECT   
DISTINCT
a.SFAccountID FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
JOIN  _SLX_AccountChannel b WITH(NOLOCK)
ON a.SLXID = b.ACCOUNTID
JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK) 
ON a.SFAccountID = c.SFAccountID
WHERE b.STATUS IN ('Active','Customer', 'Current')
AND c.Primary__C = 0
AND a.SFAccountID NOT IN (SELECT SFAccountID FROM [dbo].[Salesforce_Contacts] WHERE Primary__c = 1 AND SFAccountID IS NOT NULL);

Actual Execution plan for Query 1: Execution plan 1

Actual Execution plan for Query 2:Execution plan 2

TIME/IO STATISTICS:

Query #1 (using not exists):

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 532 ms, elapsed time = 533 ms.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Salesforce_Contacts'. Scan count 2, logical reads 3078, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'INFORMATION'. Scan count 1, logical reads 691, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ACCOUNT'. Scan count 4, logical reads 567, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Salesforce_Accounts'. Scan count 1, logical reads 680, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 250 ms,  elapsed time = 271 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Query #2 (using Not In):

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 500 ms, elapsed time = 500 ms.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Salesforce_Contacts'. Scan count 2, logical reads 3079, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'INFORMATION'. Scan count 1, logical reads 691, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ACCOUNT'. Scan count 4, logical reads 567, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Salesforce_Accounts'. Scan count 1, logical reads 680, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 157 ms,  elapsed time = 166 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Upvotes: 8

Views: 7263

Answers (6)

Eric
Eric

Reputation: 5743

I think the missing index cause the difference for EXISTS() and IN operations.

Although the question do not ask for a better query, but for me I'll try to avoid the Distinct like this

SELECT
    a.SFAccountID, a.SLXID, a.Name 
FROM 
    [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
    CROSS APPLY 
    (
        SELECT SFAccountID 
        FROM [dbo].[Salesforce_Contacts] WITH(NOLOCK) 
        WHERE SFAccountID  = a.SFAccountID 
        GROUP BY SFAccountID
        HAVING MAX(Primary__C + 0) = 0 -- Assume Primary__C is a bit value
    ) b
WHERE
    -- Actually it is the filtering condition for account channel
    EXISTS
    (
        SELECT * FROM _SLX_AccountChannel WITH(NOLOCK) 
        WHERE ACCOUNTID = a.SLXID AND STATUS IN ('Active','Customer', 'Current')
    )

Upvotes: 1

Vladimir Baranov
Vladimir Baranov

Reputation: 32703

The question was: "why NOT IN appears to be faster than NOT EXISTS".

My answer is: it only appears to be faster, but it is the same. (in this case)

Did you actually measure the time for both queries and confirm that there is a difference?

OR you just looked at the execution plans?

As far as I understand, the query cost that you see on the screenshots (53% vs 47%) is:

  • estimated query cost, even though the plans are actual;
  • it is the query cost, not time, which is combined from CPU and IO "costs".

It seems that in this particular case query optimizer generated almost identical plans for both queries. It is quite likely that the plans differ (slightly) in estimated number of rows for some operators in the plans, but actual performance is the same, because the plan shape is the same. If the estimated number of rows differ, it would lead to different estimated query cost that you see.

To see the difference in plans (if any), I would use a tool like SQL Sentry Plan Explorer. It shows more details and you can compare all aspects of the queries easier.


Rewriting the query to be faster is a different question and I don't attempt to answer it here.

Upvotes: 1

Diego
Diego

Reputation: 18379

You can do without hitting/joining Salesforce_Contacts more than once. This is more compact and faster:

SELECT a.SFAccountID, a.SLXID, a.Name
FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
JOIN  _SLX_AccountChannel b WITH(NOLOCK)
    ON a.SLXID = b.ACCOUNTID
JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK)
    ON a.SFAccountID = c.SFAccountID
WHERE b.STATUS IN ('Active','Customer', 'Current')
GROUP BY a.SFAccountID, a.SLXID, a.Name
HAVING MAX(c.Primary__C) = 0

Difference between IN and EXISTS is neglectable.

Upvotes: 0

JamieD77
JamieD77

Reputation: 13949

This is assuming that you're trying to find Accounts that do not have a Primary Contact and there can only be one Primary Contact

SELECT  a.SFAccountID, a.SLXID, a.Name 
FROM    [dbo].[Salesforce_Accounts] a
        LEFT JOIN [dbo].[Salesforce_Contacts] c ON a.SFAccountID = c.SFAccountID AND c.Primary__C = 1
WHERE
        EXISTS (SELECT  * 
                FROM SLX_AccountChannel b 
                WHERE b.ACCOUNTID = a.SLXID 
                    AND b.STATUS IN ( 'Active', 'Customer', 'Current' ))
        AND c.SFContactID IS NULL

If you want accounts that have contacts but no primary contact you can use

SELECT 
    a.SFAccountID ,
    a.SLXID ,
    a.Name
FROM 
    [dbo].[Salesforce_Accounts] a
WHERE
    a.SFAccountID IN (SELECT SFAccountID 
                    FROM [Salesforce_Contacts] 
                    GROUP BY SFAccountID 
                    HAVING SUM(CAST(Primary__c AS INT) = 0))

    AND a.SLXID IN (SELECT ACCOUNTID 
                    FROM _SLX_AccountChannel 
                    WHERE [STATUS] IN ( 'Active', 'Customer', 'Current' ))

Upvotes: 0

Prefijo Sustantivo
Prefijo Sustantivo

Reputation: 525

as far as i understand it, a not in works in the same fashion as two nested for instructions would.

so, asuming you have two tables: table(1000 records) and tabla (2000 records),

select * from table where table.field not in (select field from tabla)

is like doing

for (int i = 0;  i < 1000; i++) {
   for (int j = 0;  j < 2000; j++) {
   }
}

that is 1000*2000 = 2 million operations.

the left join with tabla.field is null trick, again, as far as i understand it, makes only 2000 operations

Use left join.

Upvotes: 1

paparazzo
paparazzo

Reputation: 45106

try

SELECT DISTINCT a.SFAccountID, a.SLXID, a.Name 
  FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
  JOIN _SLX_AccountChannel b WITH(NOLOCK)
    ON a.SLXID = b.ACCOUNTID
   AND b.STATUS IN ('Active','Customer', 'Current')
  JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK)
    ON a.SFAccountID = c.SFAccountID 
   AND c.Primary__C = 0
  LEFT JOIN [dbo].[Salesforce_Contacts] c2 WITH(NOLOCK) 
    on c2.SFAccountID = a.SFAccountID
   AND c2.Primary__c = 1
 WHERE c2.SFAccountID is null 

Upvotes: 2

Related Questions