Reputation: 155065
I have a very large web forum application (about 20 million posts since 2001) running from a SQL Server 2012 database. The data files are about 40GB in size.
I added indexes to the tables for appropriate fields, however this query (which reveals the date range of posts in each forum) takes about 40 minutes to run:
SELECT
T2.ForumId,
Forums.Title,
T2.ForumThreads,
T2.ForumPosts,
T2.ForumStart,
T2.ForumStop
FROM
Forums
INNER JOIN (
SELECT
Min(ThreadStart) As ForumStart,
Max(ThreadStop) As ForumStop,
Count(*) As ForumThreads,
Sum(ThreadPosts) As ForumPosts,
Threads.ForumId
FROM
Threads
INNER JOIN (
SELECT
Min(Posts.DateTime) As ThreadStart,
Max(Posts.DateTime) As ThreadStop,
Count(*) As ThreadPosts,
Posts.ThreadId
FROM
Posts
GROUP BY
Posts.ThreadId
) As P2 ON Threads.ThreadId = P2.ThreadId
GROUP BY
Threads.ForumId
) AS T2 ON T2.ForumId = Forums.ForumId
How could I speed it up?
UPDATE:
This is the Estimated Execution Plan, from right-to-left:
[Path 1]
Clustered Index Scan (Clustered) [Posts].[PK_Posts], Cost: 98%
Hash Match (Partial Aggregate), Cost: 2%
Parallelism (Repartition Streams), Cost: 0%
Hash Match (Aggregate), Cost 0%
Compute Scalar, Cost: 0%
Bitmap (Bitmap Create), Cost: 0%
[Path 2]
Index Scan (NonClustered) [Threads].[IX_ForumId], Cost: 0%
Parallelism (Repartition Streams), Cost: 0%
[Path 1 and 2 converge into Path 3]
Hash Match (Inner Join), Cost: 0%
Hash Match (Partial Agregate), Cost: 0%
Parallelism (Repartition Streams), Cost: 0%
Sort, Cost: 0%
Stream Aggregate (Aggregate), Cost: 0%
Compute Scalar, Cost: 0%
[Path 4]
Clustered Index Seek (Clustered) [Forums].[PK_Forums], Cost: 0%
[Path 3 and 4 converge into Path 5]
Nested Loops (Inner Join), Cost: 0%
Paralleism (Gather Streams), Cost: 0%
SELECT, Cost: 0%
Upvotes: 0
Views: 556
Reputation: 155065
I added some more indexes to the database and it sped things up considerably. Execution time is now about 20 seconds (!!). I'll admit that a lot of the added indexes were guesswork (or just adding them randomly).
Upvotes: 0
Reputation: 32575
If you denormalize by adding ForumId to the Posts table, you would be able to query all of the stats straight out of the Posts table. With the right index, this would probably perform pretty well. Of course, that will require a small change to your code to include the ForumId when inserting into the Posts table...
Upvotes: 0
Reputation: 17020
Indexes may work when you do SELECT FROM
, but the results of the sub-queries are not indexed. The join to them is probably killing the performance.
As buckley suggested, I'd try storing the intermediate results in a #temp table and adding an index before doing the final query.
But the outer SELECT
doesn't include thread-specific information. It looks like the query is just selecting the min/max date by forum. If so, you can just get the min/max/count posts grouped by forum.
Upvotes: 1
Reputation: 13633
How about something like this? Anyway, you get the idea...
SELECT f.ForumID,
f.Title,
MIN(p.[DateTime]) as ForumStart,
MAX(p.[DateTime]) as ForumStop,
COUNT(DISTINCT f.ForumID) as ForumPosts,
COUNT(DISTINCT t.ThreadID) as ForumThreads
FROM Forums f
INNER JOIN Threads t
ON f.ForumID = t.ForumID
INNER JOIN Posts p
ON p.ThreadID = p.ThreadID
GROUP BY f.ForumID, f.Title
Upvotes: 1
Reputation: 96552
Do you really need to aggregate twice? Would this query give you the same results?
SELECT
T2.ForumId,
Forums.Title,
T2.ForumThreads,
T2.ForumPosts,
T2.ForumStart,
T2.ForumStop
FROM
Forums
INNER JOIN (
SELECT
Min(ThreadStart) As ForumStart,
Max(ThreadStop) As ForumStop,
Count(*) As ForumThreads,
Sum(ThreadPosts) As ForumPosts,
Threads.ForumId
FROM
Threads
INNER JOIN (
SELECT
Posts.DateTime As ThreadStart,
Posts.DateTime As ThreadStop,
Count(*) As ThreadPosts,
Posts.ThreadId
FROM
Posts
) As P2 ON Threads.ThreadId = P2.ThreadId
GROUP BY
Threads.ForumId
) AS T2 ON T2.ForumId = Forums.ForumId
Upvotes: 0
Reputation: 14069
Have you tried putting those 2 derived table in a #temp table? SQL Server will get statistics (single column) from them and you can lay indexes on them as well.
Also, at first sight indexed views could help here as you have lots of aggregates.
Upvotes: 1