ccleve
ccleve

Reputation: 15809

Why is Postgres index so big and slow?

I have a table of names and addresses with 8 million records in Postgres 13. I'd like to do a fairly simple aggregation on it:

SELECT zipcode, count(*)
FROM mytable
GROUP BY zipcode

zipcode is a varchar(10). I have an index on the table:

CREATE INDEX zip_idx ON mytable (zipcode)

The query takes 4 seconds (update: now 1.2 seconds per suggestions in the comments below) and returns 1,400 rows. Removing count(*) does not make a difference.

Why is this so slow? One would think that iterating over only 1,400 entries would take only a few milliseconds.

The index itself is strangely large:

select pg_relation_size('zip_idx');
=> 119,455,744

So, 119 mb / 8 million rows = ~15 bytes per row. This seems like a lot. This is Postgres 13, which implements btree deduplication. The documentation says that deduplication is switched on by default. Storing a single tid shouldn't take 15 bytes, and the system should be smart enough to hit just those 1,400 entries and not process the entire list of tids.

Is there a way to dump the actual contents of the index to see what's going on?

Am I doing something wrong?

Update:

Here is the explain output. (I'm using actual column names here, not the simplified ones above).

explain (analyze, buffers, format text) select home_zip, count(*)
from contacts_27
group by home_zip

"Finalize GroupAggregate  (cost=1000.46..123094.85 rows=1220 width=13) (actual time=15.315..1529.875 rows=1438 loops=1)"
"  Group Key: home_zip"
"  Buffers: shared hit=35778"
"  ->  Gather Merge  (cost=1000.46..123070.45 rows=2440 width=13) (actual time=15.293..1526.600 rows=2871 loops=1)"
"        Workers Planned: 2"
"        Workers Launched: 2"
"        Buffers: shared hit=35778"
"        ->  Partial GroupAggregate  (cost=0.43..121788.79 rows=1220 width=13) (actual time=2.293..1239.025 rows=957 loops=3)"
"              Group Key: home_zip"
"              Buffers: shared hit=35778"
"              ->  Parallel Index Only Scan using contacts_27_home_zip_idx on contacts_27  (cost=0.43..104488.41 rows=3457635 width=5) (actual time=0.061..548.798 rows=2778682 loops=3)"
"                    Heap Fetches: 0"
"                    Buffers: shared hit=35778"
"Planning Time: 0.120 ms"
"Execution Time: 1530.303 ms"

Upvotes: 0

Views: 1421

Answers (1)

Laurenz Albe
Laurenz Albe

Reputation: 247535

PostgreSQL has to scan the whole table to get all groups, so you cannot have it faster. There is no "index skip scan" in PostgreSQL yet.

Upvotes: 3

Related Questions