Reputation: 20560
I've embarked on attempting to optimize a rather large query that has 3 nested subqueries (like Russian dolls). The query itself is generated by south
from a Django project, and I freely admit I'm no expert at SQL optimization. My strategy thus far is to start with the innermost query and work my way outwards.
Thus, the first and inner most query is
SELECT
DISTINCT ON (quote_id) quote_id,
MAX(created_at) AS max_created_at
FROM billing_pricequotestatus
GROUP BY quote_id, created_at
ORDER BY quote_id, created_at DESC;
The EXPLAIN ANALYZE
for the above is
Unique (cost=535905.10..610867.38 rows=3331657 width=12) (actual time=4364.469..7587.242 rows=1462625 loops=1)
-> GroupAggregate (cost=535905.10..602538.24 rows=3331657 width=12) (actual time=4364.467..6996.550 rows=3331656 loops=1)
Group Key: quote_id, created_at
-> Sort (cost=535905.10..544234.24 rows=3331657 width=12) (actual time=4364.460..5574.351 rows=3331657 loops=1)
Sort Key: quote_id, created_at
Sort Method: external merge Disk: 84648kB
-> Seq Scan on billing_pricequotestatus (cost=0.00..61080.57 rows=3331657 width=12) (actual time=0.013..854.722 rows=3331657 loops=1)
Planning time: 0.107 ms
Execution time: 7759.317 ms
(9 rows)
Table structure is
Table "public.billing_pricequotestatus"
Column | Type | Modifiers
------------+--------------------------+-----------------------------------------------------------------------
id | integer | not null default nextval('billing_pricequotestatus_id_seq'::regclass)
created_at | timestamp with time zone | not null
updated_at | timestamp with time zone | not null
notes | text | not null
name | character varying(20) | not null
quote_id | integer | not null
Indexes:
"billing_pricequotestatus_pkey" PRIMARY KEY, btree (id)
"billing_pricequotestatus_quote_id" btree (quote_id)
"status_timestamp_idx" btree (quote_id, created_at)
Foreign-key constraints:
"quote_id_refs_id_2b0d5331de8d31b7" FOREIGN KEY (quote_id) REFERENCES billing_pricequote(id) DEFERRABLE INITIALLY DEFERRED
I've tried http://explain.depesz.com/, but I'm not entirely sure I know how to derive next-steps from the report. I've also found an article suggesting that the ORDER BY
clause could be removed if the SELECT
will return the rows in order anyways, which I think might be the case here? Unsure how to tell that.
If I remove the ORDER BY
clause, that shaves off ~3410 ms
, but I feel this should be faster (if I only do a straight SELECT
with no aggregate function, DISTINCT
or ordering, my baseline time appears to be 832.427 ms
). I've seen several other SO posts in regards to tables 10x the size of mine getting 3-5x better performance with the right indexes. I know it's not an apples-to-apples comparison, always, but hoping for some insight anyways.
Upvotes: 2
Views: 84
Reputation: 15624
It is look like a mix of the two different solutions for the same thing: get max of created_at for each distinct quote_id.
1)
SELECT
quote_id,
MAX(created_at) AS max_created_at
FROM billing_pricequotestatus
GROUP BY quote_id
2)
SELECT
distinct on (quote_id) quote_id,
created_at
FROM billing_pricequotestatus
ORDER BY quote_id, created_at DESC
Probably something wrong with the query producer.
Upvotes: 2
Reputation: 60472
That's confusing, you create a distinct list of quote_id, created_at
and you do a MAX(created_at)
plus DISTINCT ON (quote_id)
?
This should return the same result:
SELECT
quote_id,
MAX(created_at) AS max_created_at
FROM billing_pricequotestatus
GROUP BY quote_id
Upvotes: 2