Tmdean
Tmdean

Reputation: 9299

MS Access subquery performance

I use MS Access a lot as an ad-hoc data processing tool. One thing I've noticed is that using sub-queries in certain ways tends to have very bad performance with large tables.

For example, this query performs poorly:

SELECT TOP 500 EmployeeID FROM Employee
WHERE EmployeeID NOT IN
    (SELECT EmployeeID FROM LoadHistory
    WHERE YearWeek = '2015-26');

This version of the same query performs well:

SELECT TOP 500 EmployeeID FROM Employee
WHERE NOT EXISTS
    (SELECT 1 FROM LoadHistory
    WHERE YearWeek = '2015-26' AND
        EmployeeID = Employee.EmployeeID);

And this other form of the same query also performs well:

SELECT TOP 500 Employee.EmployeeID
FROM Employee
LEFT JOIN
    (SELECT EmployeeID FROM LoadHistory
    WHERE YearWeek = '2015-26') q
ON Employee.EmployeeID = q.EmployeeID
WHERE q.EmployeeID IS NULL;

For style reasons, I prefer the first form. I can't really understand why the optimizer doesn't generate the same plan for the first and second queries. Is there any logic to how the ACE optimizer is behaving here? Are there any other ways to slightly rewrite the first query so the optimizer can do a better job?

Upvotes: 0

Views: 459

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1269445

NOT IN and NOT EXISTS are very similar . . . but not quite the same.

The semantics of NOT IN specify that it never returns true if any of the values are NULL. That means that the subquery has to verify that this is true.

My guess is that this accounts for the different optimization schemes. This is also why I prefer NOT EXISTS to NOT IN. NOT EXISTS is more intuitive in the treatment of NULL values in the subquery.

Note: You should always qualify column names when you use correlated subqueries:

SELECT TOP 500 EmployeeID
FROM Employee
WHERE NOT EXISTS (SELECT 1
                  FROM LoadHistory
                  WHERE LoadHistory.YearWeek = '2015-26' AND
                        LoadHistory.EmployeeID = Employee.EmployeeID
                 );

The compiler might be smart enough to avoid this if you declare LoadHistory.EmployeeId as NOT NULL.

I should also mention that NOT EXISTS can take advantage of an index on LoadHistory(EmployeeId) or LoadHistory(EmployeeId, YearWeek). The NOT IN version can use LoadHistory(YearWeek) or LoadHistory(YearWeek, EmployeeId). Perhaps your indexes explain the difference in performance.

Upvotes: 2

Vamsi Prabhala
Vamsi Prabhala

Reputation: 49260

It is the difference between in and exists. Exists evaluates to true the first time a sub-query matches a given condition. On the other hand, in would scan the entire table.

Upvotes: 0

Related Questions