Why does the order of columns in an index matter for a group by in Postgresql?

Question

I have a relatively large table (about a million records), with following columns:

account: character varying(36) not null
group: character varying(255) not null
classification: character varying(255) not null
size: integer not null

The account is a UUID in practice, but that doesn't really matter here I think. If I execute following simple query, it takes about 16 seconds on my machine:

select account, group, classification, max(size) 
from mytable 
group by account, group, classification

So far so good. Suppose I add an index:

CREATE INDEX concurrently ON mytable (account, group, classification);

If I execute the same query again, it now gives me back a result in less than half a second. Explaining the query also clearly shows that the index is used.

However, if I reword the query to

select account, group, classification, max(size) 
from mytable 
group by account, classification, group

It takes 16 seconds again and the index is no longer used. In my opinion, the order of the group-by criteria doesn't matter, but I'm not an expert. Any idea why Postgresql can't (or doesn't) optimize the latter query. I tried this in Postgresql 9.4

Edit: On request, here is the output of the explain. For the indexed call:

Group  (cost=0.55..133878.11 rows=95152 width=76) (actual time=0.090..660.739 rows=807 loops=1)
  Group Key: group_id, classification_id, account_id
  ->  Index Only Scan using mytable_group_id_classification_id_account_id_idx on mytable  (cost=0.55..126741.72 rows=951518 width=76) (actual time=0.088..534.645 rows=951518 loops=1)
        Heap Fetches: 951518
Planning time: 0.106 ms
Execution time: 660.852 ms

For the call with the order of the groupby criteria changed:

Group  (cost=162327.31..171842.49 rows=95152 width=76) (actual time=11114.130..13938.487 rows=807 loops=1)"
  Group Key: group_id, account_id, classification_id
  ->  Sort  (cost=162327.31..164706.10 rows=951518 width=76) (actual time=11114.127..13775.235 rows=951518 loops=1)
        Sort Key: group_id, account_id, classification_id
        Sort Method: external merge  Disk: 81136kB
        ->  Seq Scan on mytable  (cost=0.00..25562.18 rows=951518 width=76) (actual time=0.009..192.259 rows=951518 loops=1)
Planning time: 0.111 ms
Execution time: 13948.380 ms

Laurenz Albe · Accepted Answer

You are right that the result is the same no matter in which order the columns appear in the GROUP BY clause, and that the same execution plan could be used.

The PostgreSQL optimizer just doesn't consider reordering the GROUP BY expressions to see if a different ordering would match an existing index.

This is a limitation, and you can ask the pgsql-hackers list if an enhancement here would be desirable or not. You could back this up with a patch that implements the desired functionality.

However, I am not certain that such an enhancement would be accepted. The down side of such an enhancement would be that the optimizer has to work more, and that would affect planning times for all queries that use a GROUP BY clause. In addition, it is quite easy to work around this limitation: just rewrite your query and change the order of GROUP BY expressions. So I would say that things should be left the way they are now.

Why does the order of columns in an index matter for a group by in Postgresql?

Answers (2)

Related Questions