Arkadiy Kukarkin
Arkadiy Kukarkin

Reputation: 2293

How to sort and paginate groups of rows based on first row in the group?

I have items with a timestamp and a foreign key id. I want to group them by the foreign key, sort each group by timestamp taking the first 3 from each group, and also sort all groups by the timestamp of the first item, like this:

+----+-------+-------+-------+
| id | item1 | item2 | item3 |
+----+-------+-------+-------+
| A  |     1 |    13 |    99 |
| B  |    10 |    20 |    21 |
| C  |    50 |    51 |    60 |
| D  |    56 |    70 |    75 |
+----+-------+-------+-------+

I would also like to be able to select ranges based on the ts of the first item (so querying for ts > 5 AND ts < 55 would exclude A and D -- note that C contains a row with ts=60 but I still want to include that because the first element in that group has ts=50)

My approach right now is to find the ids of the first item in each set in a subquery, then select the topN for those ids, which doesn't seem ideal; we end up redoing the same sorts twice.

SELECT *
FROM   (SELECT Row_number()
                 OVER (
                   partition BY things.id
                   ORDER BY links.created_at) AS r2,
               links.*
        FROM   things
               INNER JOIN links
                       ON ( links.b_id = things.id )
        WHERE  b_id IN (SELECT thing_id
                               FROM
               (SELECT Row_number()
                         OVER (
                           partition BY links.b_id
                           ORDER BY links.created_at) AS
                       r,
                       b_id                           AS
                       thing_id,
                       created_at
                FROM   links
                WHERE  links.entity_b_type = 'thing'
                       AND links.user_id =
                           '1234') tmp
                               WHERE  r = 1
                                      AND created_at < some_time)) tmp
WHERE  r2 <= 5;

Can I somehow sort the original results (with r <= 3) without the second select pass?

Upvotes: 0

Views: 176

Answers (1)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 658092

Assuming referential integrity between things and links, the query you display can be simplified to:

SELECT *
FROM  (
   SELECT *, row_number() OVER (PARTITION BY b_id ORDER BY created_at) AS rn
   FROM   links l
   WHERE  EXISTS (
      SELECT 1
      FROM   links l1
      WHERE  l1.b_id = l.bid
      AND    l1.entity_b_type = 'thing'
      AND    l1.user_id = '1234'  -- why quoted? not integer?
      AND    l1.created_at < some_time
      )
   ) l
JOIN   things t ON t.id = l.b_id 
WHERE  l.rn <= 5;

Depending on data distribution, chances are good that a LATERAL solution is even faster:

SELECT *
FROM   things t 
     , LATERAL (
   SELECT *, row_number() OVER (ORDER BY created_at) AS rn  -- optional info
   FROM   links l
   WHERE  l.b_id = t.id
   ORDER  BY created_at
   LIMIT  5
   ) l
WHERE  EXISTS (
   SELECT 1
   FROM   links l
   WHERE  l.b_id = t.id
   AND    l.entity_b_type = 'thing'
   AND    l.user_id = '1234'
   AND    l.created_at < some_time
   );

Detailed explanation (chapter "2a. LATERAL join"):

Key to performance are matching indexes. Indexing always depends on the complete picture, but these would make the query very fast:

CREATE INDEX links_idx1 ON links (user_id, entity_b_type, created_at);
CREATE INDEX links_idx2 ON links (b_id, created_at);

It is suspicious that you first check whether the first links.created_at for the given predicates entity_b_type = 'thing' AND user_id = '1234' is old enough, but then go on using the oldest rows per b_id irregardless of those predicates. If that's a mistake, the query might be simplified further.

Untested. It's hard to say more without basic information.

Upvotes: 1

Related Questions