Reputation: 15809
I have a table of names and addresses with 8 million records in Postgres 13. I'd like to do a fairly simple aggregation on it:
SELECT zipcode, count(*)
FROM mytable
GROUP BY zipcode
zipcode is a varchar(10). I have an index on the table:
CREATE INDEX zip_idx ON mytable (zipcode)
The query takes 4 seconds (update: now 1.2 seconds per suggestions in the comments below) and returns 1,400 rows. Removing count(*)
does not make a difference.
Why is this so slow? One would think that iterating over only 1,400 entries would take only a few milliseconds.
The index itself is strangely large:
select pg_relation_size('zip_idx');
=> 119,455,744
So, 119 mb / 8 million rows = ~15 bytes per row. This seems like a lot. This is Postgres 13, which implements btree deduplication. The documentation says that deduplication is switched on by default. Storing a single tid shouldn't take 15 bytes, and the system should be smart enough to hit just those 1,400 entries and not process the entire list of tids.
Is there a way to dump the actual contents of the index to see what's going on?
Am I doing something wrong?
Update:
Here is the explain output. (I'm using actual column names here, not the simplified ones above).
explain (analyze, buffers, format text) select home_zip, count(*)
from contacts_27
group by home_zip
"Finalize GroupAggregate (cost=1000.46..123094.85 rows=1220 width=13) (actual time=15.315..1529.875 rows=1438 loops=1)"
" Group Key: home_zip"
" Buffers: shared hit=35778"
" -> Gather Merge (cost=1000.46..123070.45 rows=2440 width=13) (actual time=15.293..1526.600 rows=2871 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" Buffers: shared hit=35778"
" -> Partial GroupAggregate (cost=0.43..121788.79 rows=1220 width=13) (actual time=2.293..1239.025 rows=957 loops=3)"
" Group Key: home_zip"
" Buffers: shared hit=35778"
" -> Parallel Index Only Scan using contacts_27_home_zip_idx on contacts_27 (cost=0.43..104488.41 rows=3457635 width=5) (actual time=0.061..548.798 rows=2778682 loops=3)"
" Heap Fetches: 0"
" Buffers: shared hit=35778"
"Planning Time: 0.120 ms"
"Execution Time: 1530.303 ms"
Upvotes: 0
Views: 1421
Reputation: 247535
PostgreSQL has to scan the whole table to get all groups, so you cannot have it faster. There is no "index skip scan" in PostgreSQL yet.
Upvotes: 3