62mkv
62mkv

Reputation: 1522

How to improve PostgreSQL LIKE %text% query performance

I have 3 tables: requests, steps, and payloads. Each requests has N steps (so one-to-many), and each step has 1 payload (one-to-one).

Now, whenever I want to filter by payload body, execution time is terrible.

Here's the simplified request:

select rh.id
from request_history_step_payload rhsp
join request_history_step rhs on rhs.id =  rhsp.step_id
join request_history rh on rhs.request_id = rh.id
where rh.id> 35000 and rhs.step_type = 'CONSUMER_REQUEST' and rhsp.payload like '%09141%'

and here's it's EXPLAIN ANALYZE (run immediately after VACUUM ANALYZE):

Nested Loop  (cost=0.71..50234.28 rows=1 width=8) (actual time=120.093..2494.929 rows=12 loops=1)
  ->  Nested Loop  (cost=0.42..50233.32 rows=3 width=8) (actual time=120.083..2494.900 rows=14 loops=1)
        ->  Seq Scan on request_history_step_payload rhsp  (cost=0.00..50098.28 rows=16 width=8) (actual time=120.063..2494.800 rows=25 loops=1)
              Filter: ((payload)::text ~~ '%09141%'::text)
              Rows Removed by Filter: 164512
        ->  Index Scan using request_history_step_pkey on request_history_step rhs  (cost=0.42..8.44 rows=1 width=16) (actual time=0.003..0.003 rows=1 loops=25)
              Index Cond: (id = rhsp.step_id)
              Filter: ((step_type)::text = 'CONSUMER_REQUEST'::text)
              Rows Removed by Filter: 0
  ->  Index Only Scan using request_history_pkey on request_history rh  (cost=0.29..0.32 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=14)
        Index Cond: ((id = rhs.request_id) AND (id > 35000))
        Heap Fetches: 0
Planning Time: 0.711 ms
Execution Time: 2494.964 ms

Now, I would want to somehow suggest planner to apply LIKE operation as the LAST leg, but can't think of any way of doing so. I've tried all kinds of joins, and re-ordering them, and moving conditions to and from the ON clauses to WHERE clause and vice versa. All to no avail! Anyhow, it's looking first of all into ALL rows in the payload table, which obviously is the worst possible idea, given other conditions, that could drastically reduce the number of LIKE operations that need to be applied. So, I'd want to expect that it would apply the id condition first, which already sorts out like 90% of all records; then it would apply the step_type condition which would sort out like 85% of the rest; and thus would only apply LIKE condition to less than 5% of all payloads.

How would I go about it? I am using Postgres 11.

UPD: I was suggested to add indexes information for these tables, so:

UPD2: step and payload tables also have FKs defined (on payload.step_id->step.id and step.request_id -> request_id respectively)

I've also tried several (simplified) queries w/sub-SELECT:

explain analyze select rhs.id from request_history_step rhs
join (select step_id from request_history_step_payload rhsp where rhsp.payload like '%09141%') rhsp on rhs.id = rhsp.step_id 
where rhs.step_type = 'CONSUMER_REQUEST';

explain analyze select rhsp.step_id from request_history_step_payload rhsp
join (select id from request_history_step rhs where rhs.step_type = 'CONSUMER_REQUEST') rhs on rhs.id = rhsp.step_id 
where rhsp.payload like '%09141%';


explain analyze select rhsp.step_id from request_history_step_payload rhsp
where rhsp.step_id  in (select id from request_history_step rhs where rhs.step_type = 'CONSUMER_REQUEST')  
and rhsp.payload like '%09141%';

(also with JOIN LATERAL instead of just JOIN) - each of them gives the exact same plan, i.e. nested loop, in it nested loop, and the "outer" (first) leg in that loop is a SeqScan. which .. drives me nuts. Why would it want to execute the most expensive operation against widest possible set of rows??

UPD3: inspired by the comments under original question, I've run some further experiments. I settled with simpler query:

select rhs.request_id 
from request_history_step_payload rhsp
join request_history_step rhs on rhs.id =  rhsp.step_id
where rhs.step_type = 'CONSUMER_REQUEST' and rhsp.payload like '%09141%';

and now, it's execution plan was essentially as for the original one, just one "Nested Loop" less.

Now, I've added an index for payload.step_id:

create index request_history_step_payload_step_id on request_history_step_payload(step_id);

Run VACUUM ANALYZE; run the query with explain analyze - nothing changes. Hmmm.

Now I've run set enable_seqscan to off. And NOW WE'RE TALKING:

Gather  (cost=1000.84..88333.90 rows=3 width=8) (actual time=530.273..589.650 rows=14 loops=1)
  Workers Planned: 1
  Workers Launched: 1
  ->  Nested Loop  (cost=0.84..87333.60 rows=2 width=8) (actual time=544.639..580.608 rows=7 loops=2)
        ->  Parallel Index Scan using request_history_step_pkey on request_history_step rhs  (cost=0.42..15913.04 rows=20867 width=16) (actual time=0.029..28.667 rows=17620 loops=2)
              Filter: ((step_type)::text = 'CONSUMER_REQUEST'::text)
              Rows Removed by Filter: 64686
        ->  Index Scan using request_history_step_payload_step_id on request_history_step_payload rhsp  (cost=0.42..3.41 rows=1 width=8) (actual time=0.031..0.031 rows=0 loops=35239)
              Index Cond: (step_id = rhs.id)
              Filter: ((payload)::text ~~ '%09141%'::text)
              Rows Removed by Filter: 1
Planning Time: 0.655 ms
Execution Time: 589.688 ms

Now, with costs of SeqScan going thru the roof, I think we can see the gist of the problem: the costs of this execution plan are considered higher than for the original one (88k vs 50k), even though the execution time is actually far shorter (590ms vs 2700ms). And this apparently is the reason why Postgre planner keeps choosing "SeqScan first" despite of all my efforts to convince him otherwise.

I've also tried adding indexes for the step.step_type field; both hash and btree-based. Each of those would still generate a plan with costs way more than 50k, so with enable_seqscan set to on (the default), the planner would always ignore those.

Does anyone know any mitigation for this? I am afraid that proper solution might require changing planner variables weight, which I am not inclined to do, of course. But would be glad to hear any suggestions!

UPD4: now that I've played some more, I can report further results (with enable_seqscan set to on:

This one is slow, applies seqscan, even though there's index on step.step_type:

explain analyze 
select rhsp.step_id 
from (select request_id, id from request_history_step rhs2 where rhs2.step_type = 'CONSUMER_REQUEST') rhs
join request_history_step_payload rhsp on rhs.id =  rhsp.step_id
where rhsp.payload like '%09141%';

this one is based on suggestion by O. Jones, is still slow:

explain analyze
with rhs as (select request_id, id from request_history_step rhs2 where rhs2.step_type = 'CONSUMER_REQUEST') 
select rhsp.step_id from request_history_step_payload rhsp 
join rhs on rhs.id = rhsp.step_id 
where rhsp.payload like '%09141%';

but this one, slightly modified, is fast:

explain analyze
with rhs as (select request_id, id from request_history_step rhs2 where rhs2.step_type = 'CONSUMER_REQUEST') 
select rhsp.step_id   
from request_history_step_payload rhsp 
join rhs on rhs.id = rhsp.step_id 
where rhsp.step_id  in (select id from rhs) and rhsp.payload like '%09141%';

it's execution plan is:

Hash Join  (cost=9259.55..10097.04 rows=2 width=8) (actual time=1157.984..1162.199 rows=14 loops=1)
  Hash Cond: (rhs.id = rhsp.step_id)
  CTE rhs
    ->  Bitmap Heap Scan on request_history_step rhs2  (cost=1169.28..6918.06 rows=35262 width=16) (actual time=3.899..19.093 rows=35241 loops=1)
          Recheck Cond: ((step_type)::text = 'CONSUMER_REQUEST'::text)
          Heap Blocks: exact=3120
          ->  Bitmap Index Scan on request_history_step_step_type_hash  (cost=0.00..1160.46 rows=35262 width=0) (actual time=3.047..3.047 rows=35241 loops=1)
                Index Cond: ((step_type)::text = 'CONSUMER_REQUEST'::text)
  ->  CTE Scan on rhs  (cost=0.00..705.24 rows=35262 width=8) (actual time=3.903..5.976 rows=35241 loops=1)
  ->  Hash  (cost=2341.39..2341.39 rows=8 width=16) (actual time=1153.976..1153.976 rows=14 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        ->  Nested Loop  (cost=793.81..2341.39 rows=8 width=16) (actual time=104.170..1153.919 rows=14 loops=1)
              ->  HashAggregate  (cost=793.39..795.39 rows=200 width=8) (actual time=33.315..44.875 rows=35241 loops=1)
                    Group Key: rhs_1.id
                    ->  CTE Scan on rhs rhs_1  (cost=0.00..705.24 rows=35262 width=8) (actual time=0.001..23.590 rows=35241 loops=1)
              ->  Index Scan using request_history_step_payload_step_id on request_history_step_payload rhsp  (cost=0.42..7.72 rows=1 width=8) (actual time=0.031..0.031 rows=0 loops=35241)
                    Index Cond: (step_id = rhs_1.id)
                    Filter: ((payload)::text ~~ '%09141%'::text)
                    Rows Removed by Filter: 1
Planning Time: 1.318 ms
Execution Time: 1162.618 ms

Again, costs are down dramatically, while execution time not so much

Upvotes: 3

Views: 677

Answers (1)

O. Jones
O. Jones

Reputation: 108641

From your plan it looks like your step_type predicate on your request_history_step_payload operation isn't helping much.

So let's try a text (trigram) index with covering columns INCLUDEd to help that search step work faster.

 CREATE INDEX CONCURRENTLY rhsp_type_payload 
     ON request_history_step_payload
  USING GIN (step_type gin_trgm_ops)
INCLUDE (rhs_step_type, rhs_step_id);

It's possible it will help. Give it a try.

When you have that index you might also try reworking your query like this:

select rh.id
from (  select step_id
          from request_history_step_payload
         where  rhs.step_type = 'CONSUMER_REQUEST'
           and rhsp.payload like '%09141%'
     ) rhsp
join request_history_step rhs on rhs.id =  rhsp.step_id
join request_history rh on rhs.request_id = rh.id
where rh.id> 35000

This moves your search of that request_history_step_payload table to a subquery. You can work on optimizing the subquery separately as you try to get this all working fast enough for your application.

And, drop any duplicate indexes. They slow down INSERT and UPDATE operations for no good reason.

Upvotes: 1

Related Questions