Baker
Baker

Reputation: 505

PostgreSQL: Limit in query makes it not use index

I have a large table with BRIN index, and if I do query with limit it ignores the index and go for sequence scan, without it it uses index (I tried it several times with same results)

explain (analyze,verbose,buffers,timing,costs) 
select *
from testj.cdc_s5_gpps_ind
where id_transformace = 1293
limit 100

Limit  (cost=0.00..349.26 rows=100 width=207) (actual time=28927.179..28927.214 rows=100 loops=1)
  Output: id, date_key_trainjr...
  Buffers: shared hit=225 read=1680241
  ->  Seq Scan on testj.cdc_s5_gpps_ind  (cost=0.00..3894204.10 rows=1114998 width=207) (actual time=28927.175..28927.202 rows=100 loops=1)
        Output: id, date_key_trainjr...
        Filter: (cdc_s5_gpps_ind.id_transformace = 1293)
        Rows Removed by Filter: 59204140
        Buffers: shared hit=225 read=1680241
Planning Time: 0.149 ms
Execution Time: 28927.255 ms

explain (analyze,verbose,buffers,timing,costs) 
select *
from testj.cdc_s5_gpps_ind
where id_transformace = 1293

Bitmap Heap Scan on testj.cdc_s5_gpps_ind  (cost=324.36..979783.34 rows=1114998 width=207) (actual time=110.103..467.008 rows=1073725 loops=1)
  Output: id, date_key_trainjr...
  Recheck Cond: (cdc_s5_gpps_ind.id_transformace = 1293)
  Rows Removed by Index Recheck: 11663
  Heap Blocks: lossy=32000
  Buffers: shared hit=32056
  ->  Bitmap Index Scan on gpps_brin_index  (cost=0.00..45.61 rows=1120373 width=0) (actual time=2.326..2.326 rows=320000 loops=1)
        Index Cond: (cdc_s5_gpps_ind.id_transformace = 1293)
        Buffers: shared hit=56
Planning Time: 1.343 ms
JIT:
  Functions: 2
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 0.540 ms, Inlining 32.246 ms, Optimization 44.423 ms, Emission 22.524 ms, Total 99.732 ms
Execution Time: 537.627 ms

Is there a reason for this behavior?

PostgreSQL 12.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit

Upvotes: 4

Views: 3705

Answers (3)

Aurast
Aurast

Reputation: 3688

We ran into this problem with a very similar query:

SELECT
    "MatchPlayer"."normalizedElo"
FROM
    "match_player" "MatchPlayer"
WHERE
    "MatchPlayer"."id" = 'abc'
ORDER BY
    "MatchPlayer"."matchId" DESC
LIMIT 1

Although this table has an index on id, it wasn't being used, and this query was very slow. We found two solutions that made it fast on PostgreSQL 14.3:

  1. Use a materialized CTE:
WITH player_matches AS MATERIALIZED (
  SELECT * FROM match_player
  WHERE id = 'abc'
)
SELECT "normalizedElo" FROM player_matches
ORDER BY "matchId" DESC LIMIT 1;
  1. Add an index on id, "matchId" DESC and then run ANALYZE match_player. For some reason, it only began using the index after running an ANALYZE. We don't understand why that would be the case, but it worked.

Solution #1 has worse read performance (since it looks at all rows that match the id) while solution #2 requires more memory/storage and has a cost for INSERTs (since it involves adding a new index). Ultimately we went with #2. Actually #2 was what we tried first, except we missed the ANALYZE part initially.

Upvotes: 1

jjanes
jjanes

Reputation: 44192

There is a very simple (which is not to say good) reason for this. The planner assumes rows with id_transformace = 1293 are evenly distributed throughout the table, and so it will be able to collect 100 of them very quickly with a seq scan and then stop early. But this assumption is very wrong, and needs to go through a big chunk of the table to find 100 qualifying rows.

This assumption is not based on any statistics gathered on the table, so increasing the statistics target will not help. And extended statistics will not help either, as it only offers statistic between columns, not between a column and the physical ordering.

There are no good clean ways to solve this purely on the stock server side. One work-around is to set enable_seqscan=off before running the query, then reset afterwords. Another would be to add ORDER BY random() to your query, that way the planner knows it can't stop early. Or maybe the extension pg_hint_plan could help, I've never used it.

You might get it to change the plan by tweaking your some of your *_cost parameters, but that would likely make other things worse. Seeing the output of the EXPLAIN (ANALYZE, BUFFERS) of the LIMITed query run with enable_seqscan=off could inform that decision.

Upvotes: 8

wildplasser
wildplasser

Reputation: 44250

Since the column appears to be sparse/skew, you could try to increase the statistics size :


ALTER TABLE testj.cdc_s5_gpps_ind
  ALTER COLUMN id_transformace SET STATISTICS  1000;

ANALYZE testj.cdc_s5_gpps_ind;

Postgres-11 and above also has extended statistics, allowing multi-column correlations to be recognised and exploited. You must have some understanding of the actual structure of the data in the table to use them effectively.

Upvotes: 0

Related Questions