Dai
Dai

Reputation: 155065

SQL Server - Nested query takes 40 minutes to run

I have a very large web forum application (about 20 million posts since 2001) running from a SQL Server 2012 database. The data files are about 40GB in size.

I added indexes to the tables for appropriate fields, however this query (which reveals the date range of posts in each forum) takes about 40 minutes to run:

SELECT
    T2.ForumId,
    Forums.Title,
    T2.ForumThreads,
    T2.ForumPosts,
    T2.ForumStart,
    T2.ForumStop

FROM
    Forums
    INNER JOIN (

    SELECT
        Min(ThreadStart) As ForumStart,
        Max(ThreadStop) As ForumStop,
        Count(*) As ForumThreads,
        Sum(ThreadPosts) As ForumPosts,
        Threads.ForumId
    FROM
        Threads
        INNER JOIN (

            SELECT
                Min(Posts.DateTime) As ThreadStart,
                Max(Posts.DateTime) As ThreadStop,
                Count(*) As ThreadPosts,
                Posts.ThreadId
            FROM
                Posts
            GROUP BY
                Posts.ThreadId

        ) As P2 ON Threads.ThreadId = P2.ThreadId

    GROUP BY
        Threads.ForumId

) AS T2 ON T2.ForumId = Forums.ForumId

How could I speed it up?

UPDATE:

This is the Estimated Execution Plan, from right-to-left:

[Path 1]

Clustered Index Scan (Clustered) [Posts].[PK_Posts], Cost: 98%
Hash Match (Partial Aggregate), Cost: 2%
Parallelism (Repartition Streams), Cost: 0%
Hash Match (Aggregate), Cost 0%
Compute Scalar, Cost: 0%
Bitmap (Bitmap Create), Cost: 0%

[Path 2]

Index Scan (NonClustered) [Threads].[IX_ForumId], Cost: 0%
Parallelism (Repartition Streams), Cost: 0%

[Path 1 and 2 converge into Path 3]

Hash Match (Inner Join), Cost: 0%
Hash Match (Partial Agregate), Cost: 0%
Parallelism (Repartition Streams), Cost: 0%
Sort, Cost: 0%
Stream Aggregate (Aggregate), Cost: 0%
Compute Scalar, Cost: 0%

[Path 4]
Clustered Index Seek (Clustered) [Forums].[PK_Forums], Cost: 0%

[Path 3 and 4 converge into Path 5]

Nested Loops (Inner Join), Cost: 0%
Paralleism (Gather Streams), Cost: 0%
SELECT, Cost: 0%

Upvotes: 0

Views: 556

Answers (6)

Dai
Dai

Reputation: 155065

I added some more indexes to the database and it sped things up considerably. Execution time is now about 20 seconds (!!). I'll admit that a lot of the added indexes were guesswork (or just adding them randomly).

Upvotes: 0

Chris Shaffer
Chris Shaffer

Reputation: 32575

If you denormalize by adding ForumId to the Posts table, you would be able to query all of the stats straight out of the Posts table. With the right index, this would probably perform pretty well. Of course, that will require a small change to your code to include the ForumId when inserting into the Posts table...

Upvotes: 0

Paul Williams
Paul Williams

Reputation: 17020

Indexes may work when you do SELECT FROM, but the results of the sub-queries are not indexed. The join to them is probably killing the performance.

As buckley suggested, I'd try storing the intermediate results in a #temp table and adding an index before doing the final query.

But the outer SELECT doesn't include thread-specific information. It looks like the query is just selecting the min/max date by forum. If so, you can just get the min/max/count posts grouped by forum.

Upvotes: 1

Gordon Bell
Gordon Bell

Reputation: 13633

How about something like this? Anyway, you get the idea...

SELECT f.ForumID,
f.Title,
MIN(p.[DateTime]) as ForumStart,
MAX(p.[DateTime]) as ForumStop,
COUNT(DISTINCT f.ForumID) as ForumPosts,
COUNT(DISTINCT t.ThreadID) as ForumThreads
FROM Forums f
INNER JOIN Threads t
ON f.ForumID = t.ForumID
INNER JOIN Posts p
ON p.ThreadID = p.ThreadID
GROUP BY f.ForumID, f.Title

Upvotes: 1

HLGEM
HLGEM

Reputation: 96552

Do you really need to aggregate twice? Would this query give you the same results?

SELECT 
T2.ForumId, 
Forums.Title, 
T2.ForumThreads, 
T2.ForumPosts, 
T2.ForumStart, 
T2.ForumStop  
FROM 
    Forums 
INNER JOIN (  
    SELECT
         Min(ThreadStart) As ForumStart,
         Max(ThreadStop) As ForumStop,     
         Count(*) As ForumThreads,     
         Sum(ThreadPosts) As ForumPosts,     
         Threads.ForumId 
    FROM     
        Threads     
    INNER JOIN (          
                SELECT             
                    Posts.DateTime As ThreadStart,             
                    Posts.DateTime As ThreadStop,             
                    Count(*) As ThreadPosts,             
                    Posts.ThreadId         
                FROM             
                    Posts         
                 ) As P2 ON Threads.ThreadId = P2.ThreadId  
    GROUP BY     
        Threads.ForumId  
    ) AS T2 ON T2.ForumId = Forums.ForumId 

Upvotes: 0

buckley
buckley

Reputation: 14069

Have you tried putting those 2 derived table in a #temp table? SQL Server will get statistics (single column) from them and you can lay indexes on them as well.

Also, at first sight indexed views could help here as you have lots of aggregates.

Upvotes: 1

Related Questions