Reputation: 245
I've always been under the assumption that not exists is the way to go instead of using a not in condition. However, I doing a comparison on a query I've been using, I noticed that the execution for the Not In condition actually appears to be faster. Any insight into why this could be the case, or if I've just made a horrible assumption up until this point, would be greatly appreciated!
QUERY 1:
SELECT DISTINCT
a.SFAccountID, a.SLXID, a.Name FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
JOIN _SLX_AccountChannel b WITH(NOLOCK)
ON a.SLXID = b.ACCOUNTID
JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK)
ON a.SFAccountID = c.SFAccountID
WHERE b.STATUS IN ('Active','Customer', 'Current')
AND c.Primary__C = 0
AND NOT EXISTS
(
SELECT 1 FROM [dbo].[Salesforce_Contacts] c2 WITH(NOLOCK)
WHERE a.SFAccountID = c2.SFAccountID
AND c2.Primary__c = 1
);
QUERY 2:
SELECT
DISTINCT
a.SFAccountID FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
JOIN _SLX_AccountChannel b WITH(NOLOCK)
ON a.SLXID = b.ACCOUNTID
JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK)
ON a.SFAccountID = c.SFAccountID
WHERE b.STATUS IN ('Active','Customer', 'Current')
AND c.Primary__C = 0
AND a.SFAccountID NOT IN (SELECT SFAccountID FROM [dbo].[Salesforce_Contacts] WHERE Primary__c = 1 AND SFAccountID IS NOT NULL);
Actual Execution plan for Query 1:
Actual Execution plan for Query 2:
TIME/IO STATISTICS:
Query #1 (using not exists):
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 532 ms, elapsed time = 533 ms.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Salesforce_Contacts'. Scan count 2, logical reads 3078, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'INFORMATION'. Scan count 1, logical reads 691, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ACCOUNT'. Scan count 4, logical reads 567, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Salesforce_Accounts'. Scan count 1, logical reads 680, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 250 ms, elapsed time = 271 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Query #2 (using Not In):
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 500 ms, elapsed time = 500 ms.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Salesforce_Contacts'. Scan count 2, logical reads 3079, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'INFORMATION'. Scan count 1, logical reads 691, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'ACCOUNT'. Scan count 4, logical reads 567, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Salesforce_Accounts'. Scan count 1, logical reads 680, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 157 ms, elapsed time = 166 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
Upvotes: 8
Views: 7263
Reputation: 5743
I think the missing index cause the difference for EXISTS()
and IN
operations.
Although the question do not ask for a better query, but for me I'll try to avoid the Distinct like this
SELECT
a.SFAccountID, a.SLXID, a.Name
FROM
[dbo].[Salesforce_Accounts] a WITH(NOLOCK)
CROSS APPLY
(
SELECT SFAccountID
FROM [dbo].[Salesforce_Contacts] WITH(NOLOCK)
WHERE SFAccountID = a.SFAccountID
GROUP BY SFAccountID
HAVING MAX(Primary__C + 0) = 0 -- Assume Primary__C is a bit value
) b
WHERE
-- Actually it is the filtering condition for account channel
EXISTS
(
SELECT * FROM _SLX_AccountChannel WITH(NOLOCK)
WHERE ACCOUNTID = a.SLXID AND STATUS IN ('Active','Customer', 'Current')
)
Upvotes: 1
Reputation: 32703
The question was: "why NOT IN
appears to be faster than NOT EXISTS
".
My answer is: it only appears to be faster, but it is the same. (in this case)
Did you actually measure the time for both queries and confirm that there is a difference?
OR you just looked at the execution plans?
As far as I understand, the query cost that you see on the screenshots (53% vs 47%) is:
It seems that in this particular case query optimizer generated almost identical plans for both queries. It is quite likely that the plans differ (slightly) in estimated number of rows for some operators in the plans, but actual performance is the same, because the plan shape is the same. If the estimated number of rows differ, it would lead to different estimated query cost that you see.
To see the difference in plans (if any), I would use a tool like SQL Sentry Plan Explorer. It shows more details and you can compare all aspects of the queries easier.
Rewriting the query to be faster is a different question and I don't attempt to answer it here.
Upvotes: 1
Reputation: 18379
You can do without hitting/joining Salesforce_Contacts
more than once. This is more compact and faster:
SELECT a.SFAccountID, a.SLXID, a.Name
FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
JOIN _SLX_AccountChannel b WITH(NOLOCK)
ON a.SLXID = b.ACCOUNTID
JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK)
ON a.SFAccountID = c.SFAccountID
WHERE b.STATUS IN ('Active','Customer', 'Current')
GROUP BY a.SFAccountID, a.SLXID, a.Name
HAVING MAX(c.Primary__C) = 0
Difference between IN
and EXISTS
is neglectable.
Upvotes: 0
Reputation: 13949
This is assuming that you're trying to find Accounts that do not have a Primary Contact and there can only be one Primary Contact
SELECT a.SFAccountID, a.SLXID, a.Name
FROM [dbo].[Salesforce_Accounts] a
LEFT JOIN [dbo].[Salesforce_Contacts] c ON a.SFAccountID = c.SFAccountID AND c.Primary__C = 1
WHERE
EXISTS (SELECT *
FROM SLX_AccountChannel b
WHERE b.ACCOUNTID = a.SLXID
AND b.STATUS IN ( 'Active', 'Customer', 'Current' ))
AND c.SFContactID IS NULL
If you want accounts that have contacts but no primary contact you can use
SELECT
a.SFAccountID ,
a.SLXID ,
a.Name
FROM
[dbo].[Salesforce_Accounts] a
WHERE
a.SFAccountID IN (SELECT SFAccountID
FROM [Salesforce_Contacts]
GROUP BY SFAccountID
HAVING SUM(CAST(Primary__c AS INT) = 0))
AND a.SLXID IN (SELECT ACCOUNTID
FROM _SLX_AccountChannel
WHERE [STATUS] IN ( 'Active', 'Customer', 'Current' ))
Upvotes: 0
Reputation: 525
as far as i understand it, a not in works in the same fashion as two nested for instructions would.
so, asuming you have two tables: table(1000 records) and tabla (2000 records),
select * from table where table.field not in (select field from tabla)
is like doing
for (int i = 0; i < 1000; i++) {
for (int j = 0; j < 2000; j++) {
}
}
that is 1000*2000 = 2 million operations.
the left join with tabla.field is null trick, again, as far as i understand it, makes only 2000 operations
Use left join.
Upvotes: 1
Reputation: 45106
try
SELECT DISTINCT a.SFAccountID, a.SLXID, a.Name
FROM [dbo].[Salesforce_Accounts] a WITH(NOLOCK)
JOIN _SLX_AccountChannel b WITH(NOLOCK)
ON a.SLXID = b.ACCOUNTID
AND b.STATUS IN ('Active','Customer', 'Current')
JOIN [dbo].[Salesforce_Contacts] c WITH(NOLOCK)
ON a.SFAccountID = c.SFAccountID
AND c.Primary__C = 0
LEFT JOIN [dbo].[Salesforce_Contacts] c2 WITH(NOLOCK)
on c2.SFAccountID = a.SFAccountID
AND c2.Primary__c = 1
WHERE c2.SFAccountID is null
Upvotes: 2