Reputation: 9299
I use MS Access a lot as an ad-hoc data processing tool. One thing I've noticed is that using sub-queries in certain ways tends to have very bad performance with large tables.
For example, this query performs poorly:
SELECT TOP 500 EmployeeID FROM Employee
WHERE EmployeeID NOT IN
(SELECT EmployeeID FROM LoadHistory
WHERE YearWeek = '2015-26');
This version of the same query performs well:
SELECT TOP 500 EmployeeID FROM Employee
WHERE NOT EXISTS
(SELECT 1 FROM LoadHistory
WHERE YearWeek = '2015-26' AND
EmployeeID = Employee.EmployeeID);
And this other form of the same query also performs well:
SELECT TOP 500 Employee.EmployeeID
FROM Employee
LEFT JOIN
(SELECT EmployeeID FROM LoadHistory
WHERE YearWeek = '2015-26') q
ON Employee.EmployeeID = q.EmployeeID
WHERE q.EmployeeID IS NULL;
For style reasons, I prefer the first form. I can't really understand why the optimizer doesn't generate the same plan for the first and second queries. Is there any logic to how the ACE optimizer is behaving here? Are there any other ways to slightly rewrite the first query so the optimizer can do a better job?
Upvotes: 0
Views: 459
Reputation: 1269445
NOT IN
and NOT EXISTS
are very similar . . . but not quite the same.
The semantics of NOT IN
specify that it never returns true if any of the values are NULL
. That means that the subquery has to verify that this is true.
My guess is that this accounts for the different optimization schemes. This is also why I prefer NOT EXISTS
to NOT IN
. NOT EXISTS
is more intuitive in the treatment of NULL
values in the subquery.
Note: You should always qualify column names when you use correlated subqueries:
SELECT TOP 500 EmployeeID
FROM Employee
WHERE NOT EXISTS (SELECT 1
FROM LoadHistory
WHERE LoadHistory.YearWeek = '2015-26' AND
LoadHistory.EmployeeID = Employee.EmployeeID
);
The compiler might be smart enough to avoid this if you declare LoadHistory.EmployeeId
as NOT NULL
.
I should also mention that NOT EXISTS
can take advantage of an index on LoadHistory(EmployeeId)
or LoadHistory(EmployeeId, YearWeek)
. The NOT IN
version can use LoadHistory(YearWeek)
or LoadHistory(YearWeek, EmployeeId)
. Perhaps your indexes explain the difference in performance.
Upvotes: 2
Reputation: 49260
It is the difference between in
and exists
. Exists
evaluates to true the first time a sub-query matches a given condition. On the other hand, in
would scan the entire table.
Upvotes: 0