Reputation: 61
I have a table MYTABLE
that has approximately 25 columns, with two of them being USERID (integer)
and USERDATETIME (dateTime)
.
I have an index over this table on these two columns, with USERID
being the first column followed by USERDATETIME
.
I would like to get the maximum USERDATETIME for each USERID. So:
select USERID,MAX(USERDATETIME)
from MYTABLE WHERE USERDATETIME < '2015-10-11'
GROUP BY USERID
I would have expected the optimizer to be able to find each unique USERID
and maximum USERDATETIME
with the number of seeks equal to the number of unique USERID
s. And I would expect this to be reasonable fast. I have 2000 userids and 6 million rows in myTable. However, the actual plan shows 6 million rows from an index scan. If I use an index with USERDATETIME
/USERID
, the plan changes to use an index seek, but still 6 million rows.
Why does SQL not use the index in a way that would reduce the number of rows processed?
Upvotes: 6
Views: 711
Reputation: 453368
If you are using SQL Server this is not an optimisation generally carried out by the product (except in limited cases where the table is partitioned by that value).
However you can do it manually using the technique from here
CREATE TABLE YourTable
(
USERID INT,
USERDATETIME DATETIME,
OtherColumns CHAR(10)
)
CREATE CLUSTERED INDEX IX
ON YourTable(USERID ASC, USERDATETIME ASC);
WITH R
AS (SELECT TOP 1 USERID,
USERDATETIME
FROM YourTable
ORDER BY USERID DESC,
USERDATETIME DESC
UNION ALL
SELECT SubQuery.USERID,
SubQuery.USERDATETIME
FROM (SELECT T.USERID,
T.USERDATETIME,
rn = ROW_NUMBER()
OVER (
ORDER BY T.USERID DESC, T.USERDATETIME DESC)
FROM R
JOIN YourTable T
ON T.USERID < R.USERID) AS SubQuery
WHERE SubQuery.rn = 1)
SELECT *
FROM R
If you have another table with the UserIds it is possible to get an efficient plan more easily with
SELECT U.USERID,
CA.USERDATETIME
FROM Users U
CROSS APPLY (SELECT TOP 1 USERDATETIME
FROM YourTable Y
WHERE Y.USERID = U.USERID
ORDER BY USERDATETIME DESC) CA
Upvotes: 2
Reputation: 3756
The WHERE clause is the limiting factor on your query using the index.
With a standard SQL Server query, indexes are used either to select records quickly (which that index would allow), and to limit records returned (which that index would not allow). So, why wont this index allow fir quick limitation?
When the query optimizer considers optimizations based on a WHERE clause, it looks for an index that either starts with the item(s) in the WHERE clause, or one that can be used to efficiently identify the records that are allowed (or not allowed) to be in the result set.
With this index, the server first can find the distinct userIDs involved. It then would want to limit the rows considered based on the WHERE clause. However, to do this, the optimizer will likely estimate that it will have to conduct the equivalent of a full index or table scan AFTER locating the userIDs.
An alternate strategy that might be possible is to scan the index, identifying userIDs and dates together. This is what the optimizer chose.
One possible solution to that is a different index - one by date, then userID - in addition to the one being used. This would limit the number of records being scanned to identify userID maximums, and thus be a bit faster.
Note that your index would be fast if you did not need the WHERE clause. But the where clause requires the optimizer to consider the use case where the WHERE clause limits the items selected to the last row considered.
Also, an index where the Date field was DESCENDING order might be more efficient as well.
Upvotes: 0