Possibilities for query optimization in postgresql

Question

I have to resolve a problem in my class about query optimization in postgresql.

I have to optimize the following query.

 select ol_number,
   sum(ol_quantity) as sum_qty,
   sum(ol_amount) as sum_amount,
   avg(ol_quantity) as avg_qty,
   avg(ol_amount) as avg_amount,
   count(*) as count_order
  from   orderline
  where  ol_delivery_d > '2007-01-02 00:00:00.000000'
  group by ol_number order by ol_number

To me it seems that it already was optimized but I am not sure about that.

Is it possible through indices or something else to optimize that query (Materialized view is possible as well)?

Execution plan (also linked in comments)

"Sort  (cost=63652.89..63652.92 rows=11 width=13) (actual time=4026.270..4026.272 rows=15 loops=1)"
"  Sort Key: ol_number"
"  Sort Method: quicksort  Memory: 26kB"
"  ->  HashAggregate  (cost=63652.54..63652.70 rows=11 width=13) (actual time=4026.218..4026.247 rows=15 loops=1)"
"        ->  Seq Scan on order_line  (cost=0.00..44671.46 rows=1265405 width=13) (actual time=0.008..486.399 rows=1259054 loops=1)"
"              Filter: (ol_delivery_d > '2007-01-02 00:00:00'::timestamp without time zone)"
"              Rows Removed by Filter: 538903"
"Total runtime: 4026.346 ms"

I realize the following trick becaues i consider that avg is the same thing that sum/count so i think that is pertinent realize this or am i wrong ?

select ol_number, sum_qty, sum_amount, 
sum_qty/count_order as avg_qty,      
sum_amount/count_order as avg_amount, count_order 
from (select   ol_number, sum(ol_quantity) as sum_qty, sum(ol_amount) as sum_amount, count(*) as   count_order 
from order_line 
where ol_delivery_d > '2007-01-02 00:00:00.000000' 
group by ol_number 
order by ol_number) as t

Mike Sherrill &#39;Cat Recall&#39; · Accepted Answer

When I look at an execution plan like this, the thing that jumps out at me is "Seq Scan". Train your eyes to make "Seq Scan" jump out at you.

Index columns used in WHERE clauses

One rule of thumb for table design says, "Index every column used in a WHERE clause." On the one hand, I'd like to see an index on "ol_delivery_d". On the other hand, I don't think the query planner would use it in this particular case.

Your WHERE clause shows you calculating aggregates on 1.3 million rows, and excluding only half a million. I wouldn't be surprised if the query planner chose a sequential scan in this case even if there were an index, and quick-and-dirty tests here show that's the case. (My quick-and-dirty tests could be wrong, though. My data isn't your data; my server isn't your server.)

If there's not an index on "ol_delivery_d", create one. It won't likely help this query, but it will help queries that return a much smaller number of rows.

Use the right data type

You have a column named "ol_delivery_d", which suggests the table designer intended to store some kind of date of delivery. The absence of time data, '00:00:00.000000', also suggests you're querying by date, not by timestamp. But the execution plan shows your value being cast to "timestamp without time zone".

If you're querying only dates, change the data type of this column to "date". The "date" data type is narrower than "timestamp"--more data fits in a page, indexes are smaller, your query should run a little faster.

Avoid external sorts

Finally, the sort method for this query is optimal. In-memory sorts are much faster than external sorts.

Other things you can do

There are a lot of other things you can do to make queries run faster. (See the PostgreSQL wiki.) Doubling the speed of your server will make them faster. Creating new tablespaces on SSD and moving tables and indexes to those new tablespaces will usually help. A materialized view might or might not help; I don't think it will help in this case, but it doesn't hurt to try.

Possibilities for query optimization in postgresql

Answers (1)

Related Questions