Performing Aggregate Functions on Multi-Million Row Tables

I'm having some serious performance issues with a multi-million row table that I feel I should be able to get results from fairly quick. Here's a run down of what I have, how I'm querying it, and how long it's taking:

I'm running SQL Server 2008 Standard, so Partitioning isn't currently an option
I'm attempting to aggregate all views for all inventory for a specific account over the last 30 days.
All views are stored in the following table:

CREATE TABLE [dbo].[LogInvSearches_Daily](
    [ID] [bigint] IDENTITY(1,1) NOT NULL,
    [Inv_ID] [int] NOT NULL,
    [Site_ID] [int] NOT NULL,
    [LogCount] [int] NOT NULL,
    [LogDay] [smalldatetime] NOT NULL,
 CONSTRAINT [PK_LogInvSearches_Daily] PRIMARY KEY CLUSTERED 
(
    [ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]

This table has 132,000,000 records, and is over 4 gigs.
A sample of 10 rows from the table:

ID                   Inv_ID      Site_ID     LogCount    LogDay
-------------------- ----------- ----------- ----------- -----------------------
1                    486752      48          14          2009-07-21 00:00:00
2                    119314      51          16          2009-07-21 00:00:00
3                    313678      48          25          2009-07-21 00:00:00
4                    298863      0           1           2009-07-21 00:00:00
5                    119996      0           2           2009-07-21 00:00:00
6                    463777      534         7           2009-07-21 00:00:00
7                    339976      503         2           2009-07-21 00:00:00
8                    333501      570         4           2009-07-21 00:00:00
9                    453955      0           12          2009-07-21 00:00:00
10                   443291      0           4           2009-07-21 00:00:00

(10 row(s) affected)

I have the following index on LogInvSearches_Daily:

/****** Object:  Index [IX_LogInvSearches_Daily_LogDay]    Script Date: 05/12/2010 11:08:22 ******/
CREATE NONCLUSTERED INDEX [IX_LogInvSearches_Daily_LogDay] ON [dbo].[LogInvSearches_Daily] 
(
    [LogDay] ASC
)
INCLUDE ( [Inv_ID],
[LogCount]) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

I need to pull inventory only from the Inventory for a specific account id. I have an index on the Inventory as well.

I'm using the following query to aggregate the data and give me the top 5 records. This query is currently taking 24 seconds to return the 5 rows:

StmtText
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SELECT TOP 5
    Sum(LogCount) AS Views
    , DENSE_RANK() OVER(ORDER BY Sum(LogCount) DESC, Inv_ID DESC) AS Rank
    , Inv_ID
FROM LogInvSearches_Daily D (NOLOCK)
WHERE 
    LogDay > DateAdd(d, -30, getdate())
    AND EXISTS(
        SELECT NULL FROM propertyControlCenter.dbo.Inventory (NOLOCK) WHERE Acct_ID = 18731 AND Inv_ID = D.Inv_ID
    )
GROUP BY Inv_ID


(1 row(s) affected)

StmtText
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |--Top(TOP EXPRESSION:((5)))
       |--Sequence Project(DEFINE:([Expr1007]=dense_rank))
            |--Segment
                 |--Segment
                      |--Sort(ORDER BY:([Expr1006] DESC, [D].[Inv_ID] DESC))
                           |--Stream Aggregate(GROUP BY:([D].[Inv_ID]) DEFINE:([Expr1006]=SUM([LOALogs].[dbo].[LogInvSearches_Daily].[LogCount] as [D].[LogCount])))
                                |--Sort(ORDER BY:([D].[Inv_ID] ASC))
                                     |--Nested Loops(Inner Join, OUTER REFERENCES:([D].[Inv_ID]))
                                          |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1011], [Expr1012], [Expr1010]))
                                          |    |--Compute Scalar(DEFINE:(([Expr1011],[Expr1012],[Expr1010])=GetRangeWithMismatchedTypes(dateadd(day,(-30),getdate()),NULL,(6))))
                                          |    |    |--Constant Scan
                                          |    |--Index Seek(OBJECT:([LOALogs].[dbo].[LogInvSearches_Daily].[IX_LogInvSearches_Daily_LogDay] AS [D]), SEEK:([D].[LogDay] > [Expr1011] AND [D].[LogDay] < [Expr1012]) ORDERED FORWARD)
                                          |--Index Seek(OBJECT:([propertyControlCenter].[dbo].[Inventory].[IX_Inventory_Acct_ID]), SEEK:([propertyControlCenter].[dbo].[Inventory].[Acct_ID]=(18731) AND [propertyControlCenter].[dbo].[Inventory].[Inv_ID]=[LOA

(13 row(s) affected)

I tried using a CTE to pick up the rows first and aggregate them, but that didn't run any faster, and gives me essentially the same execution plan.


(1 row(s) affected)
StmtText
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--SET SHOWPLAN_TEXT ON;
WITH getSearches AS (
        SELECT
            LogCount
--          , DENSE_RANK() OVER(ORDER BY Sum(LogCount) DESC, Inv_ID DESC) AS Rank
            , D.Inv_ID
        FROM LogInvSearches_Daily D (NOLOCK)
            INNER JOIN propertyControlCenter.dbo.Inventory I (NOLOCK) ON Acct_ID = 18731 AND I.Inv_ID = D.Inv_ID
        WHERE 
            LogDay > DateAdd(d, -30, getdate())
--      GROUP BY Inv_ID
)

SELECT Sum(LogCount) AS Views, Inv_ID
FROM getSearches
GROUP BY Inv_ID


(1 row(s) affected)

StmtText
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |--Stream Aggregate(GROUP BY:([D].[Inv_ID]) DEFINE:([Expr1004]=SUM([LOALogs].[dbo].[LogInvSearches_Daily].[LogCount] as [D].[LogCount])))
       |--Sort(ORDER BY:([D].[Inv_ID] ASC))
            |--Nested Loops(Inner Join, OUTER REFERENCES:([D].[Inv_ID]))
                 |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1008], [Expr1009], [Expr1007]))
                 |    |--Compute Scalar(DEFINE:(([Expr1008],[Expr1009],[Expr1007])=GetRangeWithMismatchedTypes(dateadd(day,(-30),getdate()),NULL,(6))))
                 |    |    |--Constant Scan
                 |    |--Index Seek(OBJECT:([LOALogs].[dbo].[LogInvSearches_Daily].[IX_LogInvSearches_Daily_LogDay] AS [D]), SEEK:([D].[LogDay] > [Expr1008] AND [D].[LogDay] < [Expr1009]) ORDERED FORWARD)
                 |--Index Seek(OBJECT:([propertyControlCenter].[dbo].[Inventory].[IX_Inventory_Acct_ID] AS [I]), SEEK:([I].[Acct_ID]=(18731) AND [I].[Inv_ID]=[LOALogs].[dbo].[LogInvSearches_Daily].[Inv_ID] as [D].[Inv_ID]) ORDERED FORWARD)

(8 row(s) affected)


(1 row(s) affected)

So given that I'm getting good Index Seeks in my execution plan, what can I do to get this running faster?

UPDATE:

Here's the same query run without the DENSE_RANK() and it takes the exact same 24 seconds to run, and gives me the same base query plan:

StmtText
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--SET SHOWPLAN_TEXT ON
SELECT TOP 5
    Sum(LogCount) AS Views
    , Inv_ID
FROM LogInvSearches_Daily D (NOLOCK)
WHERE 
    LogDay > DateAdd(d, -30, getdate())
    AND EXISTS(
        SELECT NULL FROM propertyControlCenter.dbo.Inventory (NOLOCK) WHERE Acct_ID = 18731 AND Inv_ID = D.Inv_ID
    )
GROUP BY Inv_ID
ORDER BY Views, Inv_ID
(1 row(s) affected)

StmtText
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  |--Sort(TOP 5, ORDER BY:([Expr1006] ASC, [D].[Inv_ID] ASC))
       |--Stream Aggregate(GROUP BY:([D].[Inv_ID]) DEFINE:([Expr1006]=SUM([LOALogs].[dbo].[LogInvSearches_Daily].[LogCount] as [D].[LogCount])))
            |--Sort(ORDER BY:([D].[Inv_ID] ASC))
                 |--Nested Loops(Inner Join, OUTER REFERENCES:([D].[Inv_ID]))
                      |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1010], [Expr1011], [Expr1009]))
                      |    |--Compute Scalar(DEFINE:(([Expr1010],[Expr1011],[Expr1009])=GetRangeWithMismatchedTypes(dateadd(day,(-30),getdate()),NULL,(6))))
                      |    |    |--Constant Scan
                      |    |--Index Seek(OBJECT:([LOALogs].[dbo].[LogInvSearches_Daily].[IX_LogInvSearches_Daily_LogDay] AS [D]), SEEK:([D].[LogDay] > [Expr1010] AND [D].[LogDay] < [Expr1011]) ORDERED FORWARD)
                      |--Index Seek(OBJECT:([propertyControlCenter].[dbo].[Inventory].[IX_Inventory_Acct_ID]), SEEK:([propertyControlCenter].[dbo].[Inventory].[Acct_ID]=(18731) AND [propertyControlCenter].[dbo].[Inventory].[Inv_ID]=[LOALogs].[dbo].[LogInvS

(9 row(s) affected)

Thanks,

Dan

Upvotes: 2

Answers (3)

Daniel Renshaw

Reputation: 34177

I've not yet read through your whole question (I'll come to that shortly) but to answer an early comment: you can use partitioned views in SQL Server 2008 standard edition. It's partitioned tables (which are admittedly more flexible) that are restricted to Enterprise edition.

Paritioned views info: http://msdn.microsoft.com/en-us/library/ms190019.aspx

On the wider question I'd like to know if you really need the DENSE_RANK in there. I'm wondering if you're confused between the ORDER BY inside the DENSE_RANK and the ORDER BY of the query itself. As it stands your TOP 5 will return 5 undefined records since SQL Server does not guarantee any order on records unless an ORDER BY clause is specified (which you've not done). If you move the ORDER BY from the DENSE_RANK down to be the whole query ORDER BY as follows, the records will come out as I think you want and it will remove the need for the expensive DENSE_RANK aggregate function.

SELECT TOP 5
    SUM([LogCount]) AS [Views],
    [Inv_ID]
FROM [LogInvSearches_Daily] D (NOLOCK)
WHERE 
    [LogDay] > DateAdd(d, -30, getdate())
    AND EXISTS(
        SELECT *
        FROM Inventory (NOLOCK)
        WHERE Acct_ID = 18731
            AND Inv_ID = D.Inv_ID
    )
GROUP BY
    Inv_ID
ORDER BY
    [Views] DESC,
    [Inv_ID]

UPDATE:

The time is probably being used up here:

|--Sort(ORDER BY:([D].[Inv_ID] ASC))

You could try creating a covering index like this one:

CREATE NONCLUSTERED INDEX [IX_LogInvSearches_Daily_Perf] ON [dbo].[LogInvSearches_Daily] 
(
    [Inv_ID] ASC,
    [LogDay] ASC
)
INCLUDE
(
    [LogCount]
)

Note that I've also altered the ORDER BY slightly (Inv_ID is now sorted ASC instead of DESC). I suspect this change won't affect the results in a problematic way but may help performance since it will be returning rows in the same order that they are grouped (although ths may be irrelevant!).

Upvotes: 1

Matthew Flynn

Reputation: 2238

Acct_ID is on the Inventory table, and seems to have an index to itself (IX_Inventory_Acct_ID). Perhaps if Inventory had an index on (Acct_Id, Inv_Id) and LogInvSearches_Daily was clustered (or at least indexed) around (Inv_Id, LogDay), you'd have more luck.

BTW, I've no idea what the current clustering index on LogInvSearches_Daily.ID is supposed to be buying you. Why is it importing to have records with close IDs close on the disk?

Upvotes: 0

gbn

Reputation: 432230

Partitioning aside,

Based on our experience with larger table than yours, we extract data into a temp table (not table variable) and aggregate on that. Not for all queries, but the more complex ones.

Other than that, I agree with Daniel Renshaw's observations about DENSE_RANK

I'd also think about moving [Inv_ID], [LogCount] into the index (not include, perhaps with a DESC sort)

Upvotes: 1

Performing Aggregate Functions on Multi-Million Row Tables

Answers (3)

Related Questions