Naor Tedgi
Naor Tedgi

Reputation: 5699

How to Optimize query of substring searching in Postgres?

I have 2 tables both as the same form

table A : content | text

table B : SubContent | text

Table B containing substring Of Texts in Table A

for exampale:

Table a :

Table B :

what i want to achive is :

making south africa | zupta . . for **making south africa** worse than ever.
terkenang arwah ibu | zur eda **terkenang arwah ibu** . . eda terkenang arwah ibu . .

the query I used is very simple

select subcontent, content from A join B on  content SIMILAR TO '%'||subcontent||'%'

the query explain :

Gather  (cost=1000.00..6611032.19 rows=1678309 width=180)
  Workers Planned: 2
  ->  Nested Loop  (cost=0.00..6442201.29 rows=699295 width=180)
"        Join Filter: (A.content ~ similar_escape((('%'::text || B.subcontent) || '%'::text), NULL::text))"
        ->  Parallel Seq Scan on A  (cost=0.00..8675.79 rows=186479 width=91)
        ->  Seq Scan on B  (cost=0.00..19.50 rows=750 width=89)

i try use Large Text Indexes follow by this blog (Andrew Kane done a great job! )

https://medium.com/@ankane/large-text-indexes-in-postgres-5d7f1677f89f

i try to add an index on both of the columns that didn't help the execution takes 40 minutes, any ideas?

explain after using Laurenz Albe answer

Nested Loop  (cost=27.79..100352.93 rows=1678309 width=180)
  ->  Seq Scan on q2_distinct  (cost=0.00..19.50 rows=750 width=89)
  ->  Bitmap Heap Scan on clean_distinct  (cost=27.79..111.40 rows=2238 width=91)
"        Recheck Cond: (A.content ~ similar_escape((('%'::text || q2_distinct.part) || '%'::text), NULL::text))"
        ->  Bitmap Index Scan on "clean_distinct_Post content_idx"  (cost=0.00..27.23 rows=2238 width=0)
"              Index Cond: (A.content ~ similar_escape((('%'::text || B.content) || '%'::text), NULL::text))"

Upvotes: 1

Views: 1320

Answers (2)

Dunes
Dunes

Reputation: 40683

Postgres' in-built textsearch functions could help here. But you will need to add a second column to table A to help index it. You will also need to know the language of each string in table A.

Example query.

postgres=# select to_tsvector('english', 'Big cats eat fish on Mondays.') 
               @@ phraseto_tsquery('english', 'eat fish on monday') as query;
 query
-------
 t

Note that the pluralisation and capitalisation of Mondays did not stop the match.

Use the to_tsvector function to create a parsed form of your content, which you can use for a GIN index.

postgres=# select to_tsvector('english', 'Big cats eat fish on Mondays.');
               to_tsvector
---------------------------------------------
 'big':1 'cat':2 'eat':3 'fish':4 'monday':6

Note how to_tsvector normalises the string (changing to lowercase, removing punctuation, removing plurals, and removing low value words like "on"). This means "eat fish if monday" would also match (since "if" is also a low value word). So you will need to add a second check using the LIKE operator if you only want an exact match. But this will run far less frequently than it would have without the tsvector GIN index.

So your query would now look like,

SELECT subcontent, content 
FROM A 
JOIN B ON (
    A.tsv_content @@ phraseto_tsquery(B.lang, B.subcontent) 
    AND content LIKE '%' || subcontent || '%'
);

NB. phraseto_tsquery requires 9.6+

Upvotes: 1

Laurenz Albe
Laurenz Albe

Reputation: 246143

Any join that does not have a join condition with an equality operator (=) can only use a nested loop join.

The only index that can potentially help here is a trigram index on A:

CREATE EXTENSION pg_trgm;
CREATE INDEX ON "A" USING gin (content gin_trgm_ops);

But you shouldn't expect to get good execution times with a query like that.

Upvotes: 1

Related Questions