Why is this count query so slow?

Question

Hi I'm hosted on Heroku running postgresql 9.1.6 on a their Ika plan (7,5gb ram). I have a table called cars. I need to do the following:

SELECT COUNT(*) FROM "cars" WHERE "cars"."reference_id" = 'toyota_hilux'

Now this takes an awful lot of time (64 sec!!!)

Aggregate  (cost=2849.52..2849.52 rows=1 width=0) (actual time=63388.390..63388.391 rows=1 loops=1)
  ->  Bitmap Heap Scan on cars  (cost=24.76..2848.78 rows=1464 width=0) (actual time=1169.581..63387.361 rows=739 loops=1)
        Recheck Cond: ((reference_id)::text = 'toyota_hilux'::text)
        ->  Bitmap Index Scan on index_cars_on_reference_id  (cost=0.00..24.69 rows=1464 width=0) (actual time=547.530..547.530 rows=832 loops=1)
              Index Cond: ((reference_id)::text = 'toyota_hilux'::text)
Total runtime: 64112.412 ms

A little background:

The table holds around 3.2m rows, and the column that I'm trying to count on, has the following setup:

reference_id character varying(50);

and index:

CREATE INDEX index_cars_on_reference_id
  ON cars
  USING btree
  (reference_id COLLATE pg_catalog."default" );

What am I doing wrong? I expect that this performance is not what I should expect - or should I?

Erwin Brandstetter · Accepted Answer

What @Satya claims in his comment is not quite true. In the presence of a matching index, the planner only chooses a full table scan if table statistics imply it would return more than around 5 % (depends) of the table, because it is then faster to scan the whole table.

As you see from your own question this is not the case for your query. It uses a Bitmap Index Scan followed by a Bitmap Heap Scan. Though I would have expected a plain index scan. (?)

I notice two more things in your explain output:
The first scan find 832 rows, while the second reduces the count to 739. This would indicate that you have many dead tuples in your index.

Check the execution time after each step with EXPLAIN ANALYZE and maybe add the results to your question:

First, rerun the query with EXPLAIN ANALYZE two or three times to populate the cache. What's the result of the last run compared to the first?

VACUUM ANALYZE cars;

Rerun.

If you have lots of write operations on the table, I would set a fill factor lower than 100. Like:

ALTER TABLE cars SET (fillfactor=90);

Lower if your row size is big or you have a lot of write operations. Then:

VACUUM FULL ANALYZE cars;

This will take a while. Rerun.

Or, if you can afford to do this (and other important queries do not have contradicting requirements):

CLUSTER cars USING index_cars_on_reference_id;

This rewrites the table in the physical order of the index, which should make this kind of query much faster.

Normalize schema

If you need this to be really fast, create a table car_type with a serial primary key and reference it from the table cars. This will shrink the necessary index to a fraction of what it is now.

Goes without saying that you make a backup before you try any of this.

CREATE temp TABLE car_type (
   car_type_id serial PRIMARY KEY
 , car_type text
 );

INSERT INTO car_type (car_type)
SELECT DISTINCT car_type_id FROM cars ORDER BY car_type_id;

ANALYZE car_type;

CREATE UNIQUE INDEX car_type_uni_idx ON car_type (car_type); -- unique types

ALTER TABLE cars RENAME COLUMN car_type_id TO car_type; -- rename old col
ALTER TABLE cars ADD COLUMN car_type_id int; -- add new int col

UPDATE cars c
SET car_type_id = ct.car_type_id
FROM car_type ct
WHERE ct.car_type = c.car_type;

ALTER TABLE cars DROP COLUMN car_type; -- drop old varchar col

CREATE INDEX cars_car_type_id_idx ON cars (car_type_id);    

ALTER TABLE cars 
ADD CONSTRAINT cars_car_type_id_fkey FOREIGN KEY (car_type_id )
REFERENCES car_type (car_type_id) ON UPDATE CASCADE; -- add fk

VACUUM FULL ANALYZE cars;

Or, if you want to go all-out:

CLUSTER cars USING cars_car_type_id_idx;

Your query would now look like this:

SELECT count(*)
FROM   cars
WHERE  car_type_id = (SELECT car_type_id FROM car_type
                      WHERE car_type = 'toyota_hilux')

And should be even faster. Mainly because index and table are smaller now, but also because integer handling is faster than varchar handling. The gain will not be dramatic over the clustered table on the varchar column, though.

A welcome side effect: if you have to rename a type, it's a tiny UPDATE to one row now, not messing with the big table at all.

Why is this count query so slow?

Answers (1)

Normalize schema

Related Questions