Nikita Rybak
Nikita Rybak

Reputation: 68046

Return top N rows per group in MySQL, but efficiently

I have a pretty simple table in MySQL 5.7.30, which I boiled down to the three columns below. I'm trying to determine top N elements per group for some groups (WHERE groupable IN (3, 4, 5)). But I cannot do it efficiently even for a single group (see WHERE groupable = 3 below).

DROP TABLE IF EXISTS test;
CREATE TABLE test (
    id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    groupable BIGINT NOT NULL,
    orderable BIGINT NOT NULL,
    UNIQUE INDEX test_index_1 (groupable, orderable),
    UNIQUE INDEX test_index_2 (orderable, groupable),
    INDEX test_index_3 (orderable),
    INDEX test_index_4 (groupable)
);
INSERT INTO test(groupable, orderable) VALUES
    (1, 100), (1, 101), (1, 102), (1, 103), (1, 104), (1, 105), (1, 106), (1, 107),
    (2, 200), (2, 201), (2, 202), (2, 203), (2, 204), (2, 205), (2, 206), (2, 207),
    (3, 300), (3, 301), (3, 302), (3, 303), (3, 304), (3, 305), (3, 306), (3, 307),
    (4, 400);


EXPLAIN SELECT id FROM test
WHERE groupable = 3
ORDER BY orderable LIMIT 2;

The final EXPLAIN returns the rows value of 8. According to the documentation, "the rows column indicates the number of rows MySQL believes it must examine to execute the query." I was hoping that having a (groupable, orderable) index would alleviate the need to examine every row with groupable = 3 and allow the engine to access the largest ones directly. Is that not the case? Is there a way around that?

I see people ask this question all the time, but all the answers I've seen so far seem to have the same downside: examining every row per group. Or for those that don't have a WHERE/IN clause, examining the whole table.

Thanks for your help!

Note: while this example is small, I've reproduced the same on a table with thousands of groupables and hundreds of rows for each groupable.

Note #2: I've added extra indexes just in case, to make sure I'm not missing some hidden optimisation.

Upvotes: 0

Views: 625

Answers (2)

Chris Mederos
Chris Mederos

Reputation: 26

The composite index that includes the grouping and ordering column will fully cover this query. Additionally, mysql will stop reading the index as soon as it finds the number of results specified in the LIMIT.

In this way, the query will not examine all the rows when it actually runs. The EXPLAIN clause is an approximation and does not include this short-circuit LIMIT optimization in its estimation for ROWS examined.

From the docs... https://dev.mysql.com/doc/refman/5.7/en/limit-optimization.html

MySQL stops sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast

https://dev.mysql.com/doc/refman/5.7/en/explain-output.html

Using index - The column information is retrieved from the table using only information in the index tree without having to do an additional seek to read the actual row. This strategy can be used when the query uses only columns that are part of a single index.

Upvotes: 1

MatBailie
MatBailie

Reputation: 86798

Hopefully you have a dimension table, where the groupable id is unique?

Then, I'd use a join and a correlated sub-query.

SELECT
  dim.id,
  fact.*
FROM
  dim_groupable    AS dim
LEFT JOIN
  fact_groupable   AS fact
    ON fact.id IN (
      SELECT id
        FROM fact_groupable
       WHERE groupable = dim.id
    ORDER BY orderable
       LIMIT 2
    )

Then make the index cover groupable, orderable, id, so that the correlated sub-query can be anserwed with the index alone.

If you don't have a dimension table just use (SELECT DISTINCT groupable AS id FROM fact_groupable) AS dim. But, you really should have a dimension table.

Upvotes: 1

Related Questions