Reputation: 505
I have a large table with BRIN index, and if I do query with limit it ignores the index and go for sequence scan, without it it uses index (I tried it several times with same results)
explain (analyze,verbose,buffers,timing,costs)
select *
from testj.cdc_s5_gpps_ind
where id_transformace = 1293
limit 100
Limit (cost=0.00..349.26 rows=100 width=207) (actual time=28927.179..28927.214 rows=100 loops=1)
Output: id, date_key_trainjr...
Buffers: shared hit=225 read=1680241
-> Seq Scan on testj.cdc_s5_gpps_ind (cost=0.00..3894204.10 rows=1114998 width=207) (actual time=28927.175..28927.202 rows=100 loops=1)
Output: id, date_key_trainjr...
Filter: (cdc_s5_gpps_ind.id_transformace = 1293)
Rows Removed by Filter: 59204140
Buffers: shared hit=225 read=1680241
Planning Time: 0.149 ms
Execution Time: 28927.255 ms
explain (analyze,verbose,buffers,timing,costs)
select *
from testj.cdc_s5_gpps_ind
where id_transformace = 1293
Bitmap Heap Scan on testj.cdc_s5_gpps_ind (cost=324.36..979783.34 rows=1114998 width=207) (actual time=110.103..467.008 rows=1073725 loops=1)
Output: id, date_key_trainjr...
Recheck Cond: (cdc_s5_gpps_ind.id_transformace = 1293)
Rows Removed by Index Recheck: 11663
Heap Blocks: lossy=32000
Buffers: shared hit=32056
-> Bitmap Index Scan on gpps_brin_index (cost=0.00..45.61 rows=1120373 width=0) (actual time=2.326..2.326 rows=320000 loops=1)
Index Cond: (cdc_s5_gpps_ind.id_transformace = 1293)
Buffers: shared hit=56
Planning Time: 1.343 ms
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.540 ms, Inlining 32.246 ms, Optimization 44.423 ms, Emission 22.524 ms, Total 99.732 ms
Execution Time: 537.627 ms
Is there a reason for this behavior?
PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit
Upvotes: 4
Views: 3705
Reputation: 3688
We ran into this problem with a very similar query:
SELECT
"MatchPlayer"."normalizedElo"
FROM
"match_player" "MatchPlayer"
WHERE
"MatchPlayer"."id" = 'abc'
ORDER BY
"MatchPlayer"."matchId" DESC
LIMIT 1
Although this table has an index on id
, it wasn't being used, and this query was very slow. We found two solutions that made it fast on PostgreSQL 14.3:
WITH player_matches AS MATERIALIZED (
SELECT * FROM match_player
WHERE id = 'abc'
)
SELECT "normalizedElo" FROM player_matches
ORDER BY "matchId" DESC LIMIT 1;
id, "matchId" DESC
and then run ANALYZE match_player
. For some reason, it only began using the index after running an ANALYZE
. We don't understand why that would be the case, but it worked.Solution #1 has worse read performance (since it looks at all rows that match the id
) while solution #2 requires more memory/storage and has a cost for INSERTs (since it involves adding a new index). Ultimately we went with #2. Actually #2 was what we tried first, except we missed the ANALYZE
part initially.
Upvotes: 1
Reputation: 44192
There is a very simple (which is not to say good) reason for this. The planner assumes rows with id_transformace = 1293 are evenly distributed throughout the table, and so it will be able to collect 100 of them very quickly with a seq scan and then stop early. But this assumption is very wrong, and needs to go through a big chunk of the table to find 100 qualifying rows.
This assumption is not based on any statistics gathered on the table, so increasing the statistics target will not help. And extended statistics will not help either, as it only offers statistic between columns, not between a column and the physical ordering.
There are no good clean ways to solve this purely on the stock server side. One work-around is to set enable_seqscan=off
before running the query, then reset afterwords. Another would be to add ORDER BY random()
to your query, that way the planner knows it can't stop early. Or maybe the extension pg_hint_plan could help, I've never used it.
You might get it to change the plan by tweaking your some of your *_cost parameters, but that would likely make other things worse. Seeing the output of the EXPLAIN (ANALYZE, BUFFERS) of the LIMITed query run with enable_seqscan=off could inform that decision.
Upvotes: 8
Reputation: 44250
Since the column appears to be sparse/skew, you could try to increase the statistics size :
ALTER TABLE testj.cdc_s5_gpps_ind
ALTER COLUMN id_transformace SET STATISTICS 1000;
ANALYZE testj.cdc_s5_gpps_ind;
Postgres-11 and above also has extended statistics, allowing multi-column correlations to be recognised and exploited. You must have some understanding of the actual structure of the data in the table to use them effectively.
Upvotes: 0