Mike
Mike

Reputation: 61

SQL max() function with a where clause and group by does not use the index efficiently

I have a table MYTABLE that has approximately 25 columns, with two of them being USERID (integer) and USERDATETIME (dateTime).

I have an index over this table on these two columns, with USERID being the first column followed by USERDATETIME.

I would like to get the maximum USERDATETIME for each USERID. So:

select USERID,MAX(USERDATETIME) 
from MYTABLE WHERE USERDATETIME < '2015-10-11'
GROUP BY USERID

I would have expected the optimizer to be able to find each unique USERID and maximum USERDATETIME with the number of seeks equal to the number of unique USERIDs. And I would expect this to be reasonable fast. I have 2000 userids and 6 million rows in myTable. However, the actual plan shows 6 million rows from an index scan. If I use an index with USERDATETIME/USERID, the plan changes to use an index seek, but still 6 million rows.

Why does SQL not use the index in a way that would reduce the number of rows processed?

Upvotes: 6

Views: 711

Answers (2)

Martin Smith
Martin Smith

Reputation: 453368

If you are using SQL Server this is not an optimisation generally carried out by the product (except in limited cases where the table is partitioned by that value).

However you can do it manually using the technique from here

CREATE TABLE YourTable
  (
     USERID       INT,
     USERDATETIME DATETIME,
     OtherColumns CHAR(10)
  )

CREATE CLUSTERED INDEX IX
  ON YourTable(USERID ASC, USERDATETIME ASC);

WITH R
     AS (SELECT TOP 1 USERID,
                      USERDATETIME
         FROM   YourTable
         ORDER  BY USERID DESC,
                   USERDATETIME DESC
         UNION ALL
         SELECT SubQuery.USERID,
                SubQuery.USERDATETIME
         FROM   (SELECT T.USERID,
                        T.USERDATETIME,
                        rn = ROW_NUMBER()
                               OVER (
                                 ORDER BY T.USERID DESC, T.USERDATETIME DESC)
                 FROM   R
                        JOIN YourTable T
                          ON T.USERID < R.USERID) AS SubQuery
         WHERE  SubQuery.rn = 1)
SELECT *
FROM   R

enter image description here

If you have another table with the UserIds it is possible to get an efficient plan more easily with

SELECT U.USERID,
       CA.USERDATETIME
FROM   Users U
       CROSS APPLY (SELECT TOP 1 USERDATETIME
                    FROM   YourTable Y
                    WHERE  Y.USERID = U.USERID
                    ORDER  BY USERDATETIME DESC) CA 

enter image description here

Upvotes: 2

Laughing Vergil
Laughing Vergil

Reputation: 3756

The WHERE clause is the limiting factor on your query using the index.

With a standard SQL Server query, indexes are used either to select records quickly (which that index would allow), and to limit records returned (which that index would not allow). So, why wont this index allow fir quick limitation?

When the query optimizer considers optimizations based on a WHERE clause, it looks for an index that either starts with the item(s) in the WHERE clause, or one that can be used to efficiently identify the records that are allowed (or not allowed) to be in the result set.

With this index, the server first can find the distinct userIDs involved. It then would want to limit the rows considered based on the WHERE clause. However, to do this, the optimizer will likely estimate that it will have to conduct the equivalent of a full index or table scan AFTER locating the userIDs.

An alternate strategy that might be possible is to scan the index, identifying userIDs and dates together. This is what the optimizer chose.

One possible solution to that is a different index - one by date, then userID - in addition to the one being used. This would limit the number of records being scanned to identify userID maximums, and thus be a bit faster.

Note that your index would be fast if you did not need the WHERE clause. But the where clause requires the optimizer to consider the use case where the WHERE clause limits the items selected to the last row considered.

Also, an index where the Date field was DESCENDING order might be more efficient as well.

Upvotes: 0

Related Questions