Reputation: 49
i'm a little confused here.
Here is my (simplified) query:
SELECT *
from (SELECT documents.*,
(SELECT max(date)
FROM registrations
WHERE registrations.document_id = documents.id) AS register_date
FROM documents) AS dcmnts
ORDER BY register_date
LIMIT 20;
And here is my EXPLAIN ANALYSE
results:
Limit (cost=46697025.51..46697025.56 rows=20 width=193) (actual time=80329.201..80329.206 rows=20 loops=1)
-> Sort (cost=46697025.51..46724804.61 rows=11111641 width=193) (actual time=80329.199..80329.202 rows=20 loops=1)
Sort Key: ((SubPlan 1))
Sort Method: top-N heapsort Memory: 29kB
-> Seq Scan on documents (cost=0.00..46401348.74 rows=11111641 width=193) (actual time=0.061..73275.304 rows=11114254 loops=1)
SubPlan 1
-> Aggregate (cost=3.95..4.05 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=11114254)
-> Index Scan using registrations_document_id_index on registrations (cost=0.43..3.95 rows=2 width=4) (actual time=0.004..0.004 rows=1 loops=11114254)
Index Cond: (document_id = documents.id)
Planning Time: 0.334 ms
Execution Time: 80329.287 ms
The query takes 1m 20s
to execute, is there any way to optimize it? There are lots of rows in these tables (documents:11114642;registrations:13176070
).
In actual full query I also have some more filters and it takes up to 4 seconds to execute, and it's still too slow. This subquery orderby seems to be the bottleneck here and I can't figure out the way to optimize it.
I tried to set indexes on date/document_id columns.
Upvotes: 0
Views: 709
Reputation: 44177
In actual full query I also have some more filters and it takes up to 4 seconds to execute, and it's still too slow.
Then ask about that query. What can we say about a query we can't see? Clearly this other query isn't just like this query, except for filtering stuff out after all the work is done, as then it couldn't be faster (other than due to cache hotness) than the one you showed us. It is doing something different, it has to optimized differently.
This subquery orderby seems to be the bottleneck here and I can't figure out the way to optimize it.
The timing for the sort node includes the time if all the work that preceded it, so the time of the actual sort is 80329.206 - 73275.304 = 7 seconds, a long time perhaps but a minority of the total time. (This interpretation is not very obvious from the output itself--it from experience.)
For the query you did show us, you can get it to be quite fast, but only probabilistically correct, by using a rather convoluted construction.
with t as (select date, document_id from registrations
order by date desc, document_id desc limit 200),
t2 as (select distinct on (document_id) document_id, date from t
order by document_id, date desc),
t3 as ( select document_id, date from t2 order by date desc limit 20)
SELECT documents.*,
t3.date as register_date
FROM documents join t3 on t3.document_id = documents.id;
order by register_date
It will be efficiently supported by:
create index on registrations (register_date, document_id);
create index on documents(id);
The idea here is that the 200 most recent registrations will have at least 20 distinct document_id among them. Of course there is no way to know for sure that that will be true, so you might have to increase 200 to 20000 (which should still be pretty fast, compared to what you are currently doing) or even more to be sure you get the right answer. This also assumes that every distinct document_id matches exactly one document.id.
Upvotes: 0
Reputation:
Don't use a scalar subquery:
SELECT documents.*,
reg.register_date
FROM documents
JOIN (
SELECT document_id, max(date) as register_date
FROM registrations
GROUP BY document_id
) reg on reg.document_id = documents.id;
ORDER BY register_date
LIMIT 20;
Upvotes: 2
Reputation: 10701
Try to unnest the query
SELECT documents.id,
documents.other_attr,
max(registrations.date) register_date
FROM documents
JOIN registrations ON registrations.document_id = documents.id
GROUP BY documents.id, documents.other_attr
ORDER BY 2
LIMIT 20
The query should be supported at least by an index on registrations(document_id, date)
:
create index idx_registrations_did_date
on registrations(document_id, date)
Upvotes: 0