Reputation: 80
I have a database with courses. Each course contains a set of nodes, and some nodes contains a set of answers from students. The Answer table looks (simplified) like this:
Answer
id | courseId | nodeId | answer
------------------------------------------------
1 | 1 | 1 | <- text ->
2 | 2 | 2 | <- text ->
3 | 1 | 1 | <- text ->
4 | 1 | 3 | <- text ->
5 | 2 | 2 | <- text ->
.. | .. | .. | ..
When a teacher opens a course (i.e. courseId = 1) I want to pick the node that have received the most answers lately. I can do this using the following query:
with Answers as
(
select top 50 id, nodeId from Answer A where courseId=1 order by id desc
)
select top 1 nodeId from Answers group by nodeId order by count(id) desc
or equally using this query:
select top 1 nodeId from
(select top 50 id, nodeId from Answer A where courseId=1 order by id desc)
group by nodeId order by count(id) desc
In both querys the newest 50 answers (with the highest ids) are selected and then grouped by nodeId so I can pick the one with the highest frequency. My problem is, however, that the query is very slow. If I only run the subselect, it takes less than a second, and grouping 50 rows should be fast, but when I run the entire query it takes about 10 seconds! My guess is that sql server does the select and grouping first, and afterwards does the top 50 and top 1, which in this case leads to terrible performance.
So, how can I rewrite the query to be efficient?
Upvotes: 1
Views: 42
Reputation: 86745
To be more insightful we'd need to see the indexes on that table and the execution plans you're getting (one plan for the inner query on it's own, one plan for the full query).
I'd even recommend doing the same analysis again having added the index mentioned elsewhere on this page.
Without that information the only things we can recommend are trial and error.
For example, try avoiding using TOP
(this shouldn't matter, but we're guessing while we can't see your indexes and execution plans)
WITH
Answers AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY id DESC) AS rowId,
id,
nodeId
FROM
Answer
WHERE
courseId = 1
),
top50 AS
(
SELECT
nodeId,
COUNT(*) AS row_count
FROM
Answers
WHERE
rowId <= 50
GROUP BY
nodeId
),
ranked AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY row_count DESC, nodeId DESC) AS ordinal,
nodeID
FROM
top50
)
SELECT
nodeID
FROM
ranked
WHERE
oridinal = 1
Which is massively over the top, but functionally the same as you have in your OP, but sufficiently different to potentially get a different execution plan.
Alternatively (and not very nice), just put the results of your inner query in to a table variable, then run the outer query on the table variable.
I still expect, however, that adding the index will be the least-worst option.
Upvotes: 0
Reputation: 1270091
You can add indexes to make your queries more efficient. For this query:
with Answers as (
select top 50 id, nodeId
from Answer A
where courseId = 1
order by id desc
)
select top 1 nodeId
from Answers
group by nodeId
order by count(id) desc;
The best index is Answer(courseId, id, nodeid)
.
Upvotes: 2