Bjørn Ove Thue
Bjørn Ove Thue

Reputation: 80

Performance issue on selecting n newest rows in subselect

I have a database with courses. Each course contains a set of nodes, and some nodes contains a set of answers from students. The Answer table looks (simplified) like this:

Answer

id  | courseId |  nodeId |  answer
------------------------------------------------
 1  |   1      |   1     |  <- text ->
 2  |   2      |   2     |  <- text ->
 3  |   1      |   1     |  <- text ->
 4  |   1      |   3     |  <- text ->
 5  |   2      |   2     |  <- text ->
..  |  ..      |   ..    |  ..

When a teacher opens a course (i.e. courseId = 1) I want to pick the node that have received the most answers lately. I can do this using the following query:

with Answers as
(
   select top 50 id, nodeId from Answer A where courseId=1 order by id desc
)
select top 1 nodeId from Answers group by nodeId order by count(id) desc

or equally using this query:

select top 1 nodeId from 
    (select top 50 id, nodeId from Answer A where courseId=1 order by id desc)
    group by nodeId order by count(id) desc

In both querys the newest 50 answers (with the highest ids) are selected and then grouped by nodeId so I can pick the one with the highest frequency. My problem is, however, that the query is very slow. If I only run the subselect, it takes less than a second, and grouping 50 rows should be fast, but when I run the entire query it takes about 10 seconds! My guess is that sql server does the select and grouping first, and afterwards does the top 50 and top 1, which in this case leads to terrible performance.

So, how can I rewrite the query to be efficient?

Upvotes: 1

Views: 42

Answers (2)

MatBailie
MatBailie

Reputation: 86745

To be more insightful we'd need to see the indexes on that table and the execution plans you're getting (one plan for the inner query on it's own, one plan for the full query).

I'd even recommend doing the same analysis again having added the index mentioned elsewhere on this page.

Without that information the only things we can recommend are trial and error.

For example, try avoiding using TOP (this shouldn't matter, but we're guessing while we can't see your indexes and execution plans)

WITH
    Answers AS
(
    SELECT
        ROW_NUMBER() OVER (ORDER BY id DESC)   AS rowId,
        id,
        nodeId
    FROM
        Answer
    WHERE
        courseId = 1
),
    top50 AS
(
    SELECT
        nodeId,
        COUNT(*)   AS row_count
    FROM
        Answers
    WHERE
        rowId <= 50
    GROUP BY
        nodeId
),
    ranked AS
(
    SELECT
        ROW_NUMBER() OVER (ORDER BY row_count DESC, nodeId DESC)  AS ordinal,
        nodeID
    FROM
        top50
)
SELECT
    nodeID
FROM
    ranked
WHERE
    oridinal = 1

Which is massively over the top, but functionally the same as you have in your OP, but sufficiently different to potentially get a different execution plan.

Alternatively (and not very nice), just put the results of your inner query in to a table variable, then run the outer query on the table variable.

I still expect, however, that adding the index will be the least-worst option.

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1270091

You can add indexes to make your queries more efficient. For this query:

with Answers as (
      select top 50 id, nodeId
      from Answer A
      where courseId = 1
      order by id desc
     )
select top 1 nodeId
from Answers
group by nodeId
order by count(id) desc;

The best index is Answer(courseId, id, nodeid).

Upvotes: 2

Related Questions