Why is Postgres not using index on a simple GROUP BY?

Question

I have created a 36M rows table with an index on type column:

CREATE TABLE items AS
  SELECT
    (random()*36000000)::integer AS id,
    (random()*10000)::integer AS type,
    md5(random()::text) AS s
  FROM
    generate_series(1,36000000);
CREATE INDEX items_type_idx ON items USING btree ("type");

I run this simple query and expect postgresql to use my index:

explain select count(*) from "items" group by "type";

But the query planner decides to use Seq Scan instead:

HashAggregate  (cost=734592.00..734627.90 rows=3590 width=12) (actual time=6477.913..6478.344 rows=3601 loops=1)
  Group Key: type
  ->  Seq Scan on items  (cost=0.00..554593.00 rows=35999800 width=4) (actual time=0.044..1820.522 rows=36000000 loops=1)
Planning time: 0.107 ms
Execution time: 6478.525 ms

Time without EXPLAIN: 5s 979ms

I have tried several solutions from here and here:

Run VACUUM ANALYZE or VACUUM ANALYZE
Configure default_statistics_target, random_page_cost, work_mem

but nothing helps apart from setting enable_seqscan = OFF:

SET enable_seqscan = OFF;
explain select count(*) from "items" group by "type";

GroupAggregate  (cost=0.56..1114880.46 rows=3590 width=12) (actual time=5.637..5256.406 rows=3601 loops=1)
  Group Key: type
  ->  Index Only Scan using items_type_idx on items  (cost=0.56..934845.56 rows=35999800 width=4) (actual time=0.074..2783.896 rows=36000000 loops=1)
        Heap Fetches: 0
Planning time: 0.103 ms
Execution time: 5256.667 ms

Time without EXPLAIN: 659ms

Query with index scan is about 10x faster on my machine.

Is there a better solution than setting enable_seqscan?

UPD1

My postgresql version is 9.6.3, work_mem = 4MB (tried 64MB), random_page_cost = 4 (tried 1.1), max_parallel_workers_per_gather = 0 (tried 4).

UPD2

I have tried to fill type column not with random numbers, but with i / 10000 to make pg_stats.correlation = 1 - still seqscan.

UPD3

@jgh is 100% right:

This typically only happens when the table's row width is much wider than some indexes

I've made large column data and now postgres use index. Thanks everyone!

JGH · Accepted Answer

The Index-only scans wiki says

It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.

and

Index-only scans are only used when the planner surmises that that will reduce the total amount of I/O required, according to its imperfect cost-based modelling. This all heavily depends on visibility of tuples, if an index would be used anyway (i.e. how selective a predicate is, etc), and if there is actually an index available that could be used by an index-only scan in principle

Accordingly, your index is not considered "significantly smaller" and the entire dataset is to be read, which leads the planner in using a seq scan

Why is Postgres not using index on a simple GROUP BY?

Answers (1)

Related Questions