Yves V.
Yves V.

Reputation: 773

Why does the order of columns in an index matter for a group by in Postgresql?

I have a relatively large table (about a million records), with following columns:

The account is a UUID in practice, but that doesn't really matter here I think. If I execute following simple query, it takes about 16 seconds on my machine:

select account, group, classification, max(size) 
from mytable 
group by account, group, classification

So far so good. Suppose I add an index:

CREATE INDEX concurrently ON mytable (account, group, classification);

If I execute the same query again, it now gives me back a result in less than half a second. Explaining the query also clearly shows that the index is used.

However, if I reword the query to

select account, group, classification, max(size) 
from mytable 
group by account, classification, group

It takes 16 seconds again and the index is no longer used. In my opinion, the order of the group-by criteria doesn't matter, but I'm not an expert. Any idea why Postgresql can't (or doesn't) optimize the latter query. I tried this in Postgresql 9.4

Edit: On request, here is the output of the explain. For the indexed call:

Group  (cost=0.55..133878.11 rows=95152 width=76) (actual time=0.090..660.739 rows=807 loops=1)
  Group Key: group_id, classification_id, account_id
  ->  Index Only Scan using mytable_group_id_classification_id_account_id_idx on mytable  (cost=0.55..126741.72 rows=951518 width=76) (actual time=0.088..534.645 rows=951518 loops=1)
        Heap Fetches: 951518
Planning time: 0.106 ms
Execution time: 660.852 ms

For the call with the order of the groupby criteria changed:

Group  (cost=162327.31..171842.49 rows=95152 width=76) (actual time=11114.130..13938.487 rows=807 loops=1)"
  Group Key: group_id, account_id, classification_id
  ->  Sort  (cost=162327.31..164706.10 rows=951518 width=76) (actual time=11114.127..13775.235 rows=951518 loops=1)
        Sort Key: group_id, account_id, classification_id
        Sort Method: external merge  Disk: 81136kB
        ->  Seq Scan on mytable  (cost=0.00..25562.18 rows=951518 width=76) (actual time=0.009..192.259 rows=951518 loops=1)
Planning time: 0.111 ms
Execution time: 13948.380 ms

Upvotes: 3

Views: 2736

Answers (2)

ABentSpoon
ABentSpoon

Reputation: 5169

Actually, the order of the columns in the GROUP BY clause does affect the result. By default, the result will be sorted by the columns in the GROUP BY. If you set your own ORDER BY, the result and index usage will be the same.

To demonstrate:

CREATE TABLE coconuts (
  mass int,
  volume int,
  loveliness int
);

INSERT INTO coconuts (mass, volume, loveliness)
  SELECT (random() * 5)::int
       , (random() * 5)::int
       , (random() * 1000 + 9000)::int
  FROM GENERATE_SERIES(1,10000000);

Note how the order of the columns in the GROUP BY affects the ordering:

SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY mass, volume;

 mass | volume |  max  
------+--------+-------
    0 |      0 | 10000
    0 |      1 | 10000
    0 |      2 | 10000
...

SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY volume, mass;

 mass | volume |  max  
------+--------+-------
    0 |      0 | 10000
    1 |      0 | 10000
    2 |      0 | 10000
...

And how it affects the query plan:

CREATE INDEX ON coconuts (mass, volume);
SET enable_seqscan=false; --To force the index if possible

EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY (mass, volume);
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1000.46..460459.11 rows=40000 width=12)
   Group Key: mass, volume
   ->  Gather Merge  (cost=1000.46..459459.11 rows=80000 width=12)
         Workers Planned: 2
         ->  Partial GroupAggregate  (cost=0.43..449225.10 rows=40000 width=12)
               Group Key: mass, volume
               ->  Parallel Index Scan using coconuts_mass_volume_idx on coconuts  (cost=0.43..417575.10 rows=4166667 width=12)
(7 rows)


EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY (volume, mass);
                                            QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=10001658532.83..10001758932.83 rows=40000 width=12)
   Group Key: volume, mass
   ->  Sort  (cost=10001658532.83..10001683532.83 rows=10000000 width=12)
         Sort Key: volume, mass
         ->  Seq Scan on coconuts  (cost=10000000000.00..10000154055.00 rows=10000000 width=12)
(5 rows)

However, if you match your ORDER BY to the original GROUP BY, the original query plan returns, at least in postgres 11.5.

EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY volume, mass
  ORDER BY mass, volume;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1000.46..460459.11 rows=40000 width=12)
   Group Key: mass, volume
   ->  Gather Merge  (cost=1000.46..459459.11 rows=80000 width=12)
         Workers Planned: 2
         ->  Partial GroupAggregate  (cost=0.43..449225.10 rows=40000 width=12)
               Group Key: mass, volume
               ->  Parallel Index Scan using coconuts_mass_volume_idx on coconuts  (cost=0.43..417575.10 rows=4166667 width=12)
(7 rows)

Upvotes: 3

Laurenz Albe
Laurenz Albe

Reputation: 247330

You are right that the result is the same no matter in which order the columns appear in the GROUP BY clause, and that the same execution plan could be used.

The PostgreSQL optimizer just doesn't consider reordering the GROUP BY expressions to see if a different ordering would match an existing index.

This is a limitation, and you can ask the pgsql-hackers list if an enhancement here would be desirable or not. You could back this up with a patch that implements the desired functionality.

However, I am not certain that such an enhancement would be accepted. The down side of such an enhancement would be that the optimizer has to work more, and that would affect planning times for all queries that use a GROUP BY clause. In addition, it is quite easy to work around this limitation: just rewrite your query and change the order of GROUP BY expressions. So I would say that things should be left the way they are now.

Upvotes: 2

Related Questions