SQL SERVER generates different and unoptimized execution plan for two similar queries

Question

The following two queries are performed with SQL SERVER

SELECT TOP(10) [c].[Name] AS [I0], [c].[Surname] AS [I1],(
    SELECT MAX([r].[Date])
    FROM [Presences].[Regs] AS [r]
    WHERE ([c].[Id] = [r].[ColId]) AND [r].[ColId] IS NOT NULL) AS [I7], [c].[Id] AS [I10]
FROM [Presences].[Cols] AS [c]

SELECT TOP(10) [c].[Code] AS [I0], [c].[Description] AS [I1], (
    SELECT MAX([r].[Date])
    FROM [Presences].[Regs] AS [r]
    WHERE ([c].[Id] = [r].[CantId]) AND [r].[CantId] IS NOT NULL) AS [I9], [c].[Id] AS [I10]
FROM [Presences].[Cants] AS [c]

The second query is evaluated in less than 1 second but the first one takes more than 20 seconds.

Infact the generated execution plan for the first one is very different:

Bad execution plan of the first query

if compared with the second one:

Good execution plan of the second query

It is unclear to me why an index seek is not choosen.

Those are the indexes declared on the table [Regs]:

CREATE NONCLUSTERED INDEX [IX_Regs_Date] ON [Presences].[Regs]
(
    [Date] ASC
)

CREATE NONCLUSTERED INDEX [IX_Regs_ColId] ON [Presences].[Regs]
(
    [ColId] ASC
)

CREATE NONCLUSTERED INDEX [IX_Regs_CantId] ON [Presences].[Regs]
(
    [CantId] ASC
)

Table [Cols] has around 600 rows and table [Cants] 21000.

The interesting fact is that the following query (using FORCESEEK) generates the correct execution plan:

SELECT TOP(10) [c].[Name] AS [I0], [c].[Surname] AS [I1],(
    SELECT MAX([r].[Date])
    FROM [Presences].[Regs] AS [r]
    WITH (FORCESEEK)  
    WHERE ([c].[Id] = [r].[ColId]) AND [r].[ColId] IS NOT NULL) AS [I8]
FROM [Presences].[Cols] AS [c]
ORDER BY [c].[Name], [c].[Surname]

but I can't specify this hint because the queries are generated using an ORM.

If you need further informations I will be glad to provide them.

Martin Smith · Accepted Answer

Your problem query is

SELECT TOP(10) c.Name                      AS I0,
               c.Surname                   AS I1,
               (SELECT MAX(r.Date)
                FROM   Presences.Regs AS r
                WHERE  ( c.Id = r.ColId )) AS I7,
               c.Id                        AS I10
FROM   Presences.Cols AS c 
ORDER BY c.Name, c.Surname

In both the good and the bad plan the top is the same

It scans the Cols table, sorts by Name, Surname and then proceeds to calculate the MAX(Date) in Presences.Regs where r.ColId matches the corresponding c.Id from the outer row.

The ideal index for this would be one on ColId, Date - so it can seek into the index and just read the last one for that ColId.

You don't have this index though so it has two choices

Scan IX_Regs_Date - this is in date order so it can stop the scan as soon as it finds the first row matching c.Id = r.ColId
Seek into IX_Regs_ColId. Get all rows matching the predicate and then aggregate them to find the MAX.

For option 1 it estimates that it will only have to read 283 rows per scan on average before finding the first one matching c.Id = r.ColId. In reality it reads 1,358,719 per execution (and there are 10 executions so this is a corresponding 13.58 million key lookups). There are 1,517,230 rows in the table so it looks as though for some reason all the ones matching the join are congregated towards later dates in the table.

The easiest way of fixing this would be to give it the ideal index of ColId, Date - this covers the query so removes the lookup that is present even in the "good" plan and is an obvious best choice so will remove the temptation for SQL Server to select the "bad" plan.

--Suggested replacement for IX_Regs_ColId
CREATE NONCLUSTERED INDEX [IX_Regs_ColId_Date] ON [Presences].[Regs]
(
    [ColId] ASC,
    [Date] ASC
)

SQL SERVER generates different and unoptimized execution plan for two similar queries

Answers (1)

Related Questions