Reputation: 773
I have a relatively large table (about a million records), with following columns:
The account is a UUID in practice, but that doesn't really matter here I think. If I execute following simple query, it takes about 16 seconds on my machine:
select account, group, classification, max(size)
from mytable
group by account, group, classification
So far so good. Suppose I add an index:
CREATE INDEX concurrently ON mytable (account, group, classification);
If I execute the same query again, it now gives me back a result in less than half a second. Explaining the query also clearly shows that the index is used.
However, if I reword the query to
select account, group, classification, max(size)
from mytable
group by account, classification, group
It takes 16 seconds again and the index is no longer used. In my opinion, the order of the group-by criteria doesn't matter, but I'm not an expert. Any idea why Postgresql can't (or doesn't) optimize the latter query. I tried this in Postgresql 9.4
Edit: On request, here is the output of the explain. For the indexed call:
Group (cost=0.55..133878.11 rows=95152 width=76) (actual time=0.090..660.739 rows=807 loops=1)
Group Key: group_id, classification_id, account_id
-> Index Only Scan using mytable_group_id_classification_id_account_id_idx on mytable (cost=0.55..126741.72 rows=951518 width=76) (actual time=0.088..534.645 rows=951518 loops=1)
Heap Fetches: 951518
Planning time: 0.106 ms
Execution time: 660.852 ms
For the call with the order of the groupby criteria changed:
Group (cost=162327.31..171842.49 rows=95152 width=76) (actual time=11114.130..13938.487 rows=807 loops=1)"
Group Key: group_id, account_id, classification_id
-> Sort (cost=162327.31..164706.10 rows=951518 width=76) (actual time=11114.127..13775.235 rows=951518 loops=1)
Sort Key: group_id, account_id, classification_id
Sort Method: external merge Disk: 81136kB
-> Seq Scan on mytable (cost=0.00..25562.18 rows=951518 width=76) (actual time=0.009..192.259 rows=951518 loops=1)
Planning time: 0.111 ms
Execution time: 13948.380 ms
Upvotes: 3
Views: 2736
Reputation: 5169
Actually, the order of the columns in the GROUP BY
clause does affect the result. By default, the result will be sorted by the columns in the GROUP BY
. If you set your own ORDER BY
, the result and index usage will be the same.
To demonstrate:
CREATE TABLE coconuts (
mass int,
volume int,
loveliness int
);
INSERT INTO coconuts (mass, volume, loveliness)
SELECT (random() * 5)::int
, (random() * 5)::int
, (random() * 1000 + 9000)::int
FROM GENERATE_SERIES(1,10000000);
Note how the order of the columns in the GROUP BY
affects the ordering:
SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY mass, volume;
mass | volume | max
------+--------+-------
0 | 0 | 10000
0 | 1 | 10000
0 | 2 | 10000
...
SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY volume, mass;
mass | volume | max
------+--------+-------
0 | 0 | 10000
1 | 0 | 10000
2 | 0 | 10000
...
And how it affects the query plan:
CREATE INDEX ON coconuts (mass, volume);
SET enable_seqscan=false; --To force the index if possible
EXPLAIN
SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY (mass, volume);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=1000.46..460459.11 rows=40000 width=12)
Group Key: mass, volume
-> Gather Merge (cost=1000.46..459459.11 rows=80000 width=12)
Workers Planned: 2
-> Partial GroupAggregate (cost=0.43..449225.10 rows=40000 width=12)
Group Key: mass, volume
-> Parallel Index Scan using coconuts_mass_volume_idx on coconuts (cost=0.43..417575.10 rows=4166667 width=12)
(7 rows)
EXPLAIN
SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY (volume, mass);
QUERY PLAN
------------------------------------------------------------------------------------------------
GroupAggregate (cost=10001658532.83..10001758932.83 rows=40000 width=12)
Group Key: volume, mass
-> Sort (cost=10001658532.83..10001683532.83 rows=10000000 width=12)
Sort Key: volume, mass
-> Seq Scan on coconuts (cost=10000000000.00..10000154055.00 rows=10000000 width=12)
(5 rows)
However, if you match your ORDER BY
to the original GROUP BY
, the original query plan returns, at least in postgres 11.5.
EXPLAIN
SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY volume, mass
ORDER BY mass, volume;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=1000.46..460459.11 rows=40000 width=12)
Group Key: mass, volume
-> Gather Merge (cost=1000.46..459459.11 rows=80000 width=12)
Workers Planned: 2
-> Partial GroupAggregate (cost=0.43..449225.10 rows=40000 width=12)
Group Key: mass, volume
-> Parallel Index Scan using coconuts_mass_volume_idx on coconuts (cost=0.43..417575.10 rows=4166667 width=12)
(7 rows)
Upvotes: 3
Reputation: 247330
You are right that the result is the same no matter in which order the columns appear in the GROUP BY
clause, and that the same execution plan could be used.
The PostgreSQL optimizer just doesn't consider reordering the GROUP BY
expressions to see if a different ordering would match an existing index.
This is a limitation, and you can ask the pgsql-hackers list if an enhancement here would be desirable or not. You could back this up with a patch that implements the desired functionality.
However, I am not certain that such an enhancement would be accepted. The down side of such an enhancement would be that the optimizer has to work more, and that would affect planning times for all queries that use a GROUP BY
clause. In addition, it is quite easy to work around this limitation: just rewrite your query and change the order of GROUP BY
expressions. So I would say that things should be left the way they are now.
Upvotes: 2