Reputation: 3467
I encountered a strange behaviour of the Postgres optimizer on the following query:
select count(product0_.id) as col_0_0_ from Product product0_
where product0_.active=true
and (product0_.aggregatorId is null
or product0_.aggregatorId in ($1 , $2 , $3))
Product
has about 54 columns, active
is a boolean having a btree index, and aggregatorId
is 'varchar(15)` and has a btree index.
On this query above the index for 'aggregatorId' is not used:
Aggregate (cost=169995.75..169995.76 rows=1 width=32) (actual time=3904.726..3904.727 rows=1 loops=1)
-> Seq Scan on product product0_ (cost=0.00..165510.39 rows=1794146 width=32) (actual time=0.055..2407.195 rows=1851827 loops=1)
Filter: (active AND ((aggregatorid IS NULL) OR ((aggregatorid)::text = ANY ('{5109037,5001015,70601}'::text[]))))
Rows Removed by Filter: 542146
Total runtime: 3904.925 ms
But if we reduce the query by leaving out the null check for this column, the index gets used:
Aggregate (cost=17600.93..17600.94 rows=1 width=32) (actual time=614.933..614.935 rows=1 loops=1)
-> Index Scan using idx_prod_aggr on product product0_ (cost=0.43..17487.56 rows=45347 width=32) (actual time=19.284..594.509 rows=12099 loops=1)
Index Cond: ((aggregatorid)::text = ANY ('{5109037,5001015,70601}'::text[]))
Filter: active
Rows Removed by Filter: 49130
Total runtime: 150.255 ms
As far as I know a btree index can handle null checks, so I don't understand why the index is not used for the complete query. The product table contains about 2.3 million entries, so it is not very fast.
EDIT: The index is very standard:
CREATE INDEX idx_prod_aggr
ON product
USING btree
(aggregatorid COLLATE pg_catalog."default");
Upvotes: 3
Views: 146
Reputation: 11619
Your problem looked interesting, so I reproduced your scenario - postgres 9.1, table with 1M rows, one boolean column, one varchar column, both indexed, half of table has NULL names.
I had same explain analyze output when varchar column was not indexed. However, with index postgres uses bitmap scan on NULL condition and IN condition and then merges them with OR condition.
Then he uses seq scan on boolean condition (because indexes are separated)
explain analyze
select * from A where active is true and ((name is null) OR (name in ('1','2','3') ));
See output:
"Bitmap Heap Scan on a (cost=17.34..21.35 rows=1 width=18) (actual time=0.048..0.048 rows=0 loops=1)"
" Recheck Cond: ((name IS NULL) OR ((name)::text = ANY ('{1,2,3}'::text[])))"
" Filter: (active IS TRUE)"
" -> BitmapOr (cost=17.34..17.34 rows=1 width=0) (actual time=0.047..0.047 rows=0 loops=1)"
" -> Bitmap Index Scan on idx_prod_aggr (cost=0.00..4.41 rows=1 width=0) (actual time=0.010..0.010 rows=0 loops=1)"
" Index Cond: (name IS NULL)"
" -> Bitmap Index Scan on idx_prod_aggr (cost=0.00..12.93 rows=1 width=0) (actual time=0.036..0.036 rows=0 loops=1)"
" Index Cond: ((name)::text = ANY ('{1,2,3}'::text[]))"
"Total runtime: 0.077 ms"
This makes me think that you missed some details, if so, add them to your question.
Upvotes: 1
Reputation: 23562
Since there are many identical values for the column which you use in the where clause (78% of all the table rows according to your numbers), the database will conclude that it is cheaper to use full table scan than to waste additional time to read the index.
The rule of thumb in most database vendors is that index will probably not be used if it can't narrow the search down to about 5% of all the table records.
Upvotes: 1