Reputation: 2293
I have items with a timestamp and a foreign key id. I want to group them by the foreign key, sort each group by timestamp taking the first 3 from each group, and also sort all groups by the timestamp of the first item, like this:
+----+-------+-------+-------+
| id | item1 | item2 | item3 |
+----+-------+-------+-------+
| A | 1 | 13 | 99 |
| B | 10 | 20 | 21 |
| C | 50 | 51 | 60 |
| D | 56 | 70 | 75 |
+----+-------+-------+-------+
I would also like to be able to select ranges based on the ts of the first item (so querying for ts > 5 AND ts < 55
would exclude A and D -- note that C contains a row with ts=60 but I still want to include that because the first element in that group has ts=50)
My approach right now is to find the ids of the first item in each set in a subquery, then select the topN for those ids, which doesn't seem ideal; we end up redoing the same sorts twice.
SELECT *
FROM (SELECT Row_number()
OVER (
partition BY things.id
ORDER BY links.created_at) AS r2,
links.*
FROM things
INNER JOIN links
ON ( links.b_id = things.id )
WHERE b_id IN (SELECT thing_id
FROM
(SELECT Row_number()
OVER (
partition BY links.b_id
ORDER BY links.created_at) AS
r,
b_id AS
thing_id,
created_at
FROM links
WHERE links.entity_b_type = 'thing'
AND links.user_id =
'1234') tmp
WHERE r = 1
AND created_at < some_time)) tmp
WHERE r2 <= 5;
Can I somehow sort the original results (with r <= 3) without the second select pass?
Upvotes: 0
Views: 176
Reputation: 658092
Assuming referential integrity between things and links, the query you display can be simplified to:
SELECT *
FROM (
SELECT *, row_number() OVER (PARTITION BY b_id ORDER BY created_at) AS rn
FROM links l
WHERE EXISTS (
SELECT 1
FROM links l1
WHERE l1.b_id = l.bid
AND l1.entity_b_type = 'thing'
AND l1.user_id = '1234' -- why quoted? not integer?
AND l1.created_at < some_time
)
) l
JOIN things t ON t.id = l.b_id
WHERE l.rn <= 5;
Depending on data distribution, chances are good that a LATERAL
solution is even faster:
SELECT *
FROM things t
, LATERAL (
SELECT *, row_number() OVER (ORDER BY created_at) AS rn -- optional info
FROM links l
WHERE l.b_id = t.id
ORDER BY created_at
LIMIT 5
) l
WHERE EXISTS (
SELECT 1
FROM links l
WHERE l.b_id = t.id
AND l.entity_b_type = 'thing'
AND l.user_id = '1234'
AND l.created_at < some_time
);
Detailed explanation (chapter "2a. LATERAL
join"):
Key to performance are matching indexes. Indexing always depends on the complete picture, but these would make the query very fast:
CREATE INDEX links_idx1 ON links (user_id, entity_b_type, created_at);
CREATE INDEX links_idx2 ON links (b_id, created_at);
It is suspicious that you first check whether the first links.created_at
for the given predicates entity_b_type = 'thing' AND user_id = '1234'
is old enough, but then go on using the oldest rows per b_id
irregardless of those predicates. If that's a mistake, the query might be simplified further.
Untested. It's hard to say more without basic information.
Upvotes: 1