Reputation: 5699
I have 2 tables both as the same form
table A : content | text
table B : SubContent | text
Table B containing substring Of Texts in Table A
for exampale:
Table a :
Table B :
what i want to achive is :
making south africa | zupta . . for **making south africa** worse than ever.
terkenang arwah ibu | zur eda **terkenang arwah ibu** . . eda terkenang arwah ibu . .
the query I used is very simple
select subcontent, content from A join B on content SIMILAR TO '%'||subcontent||'%'
the query explain :
Gather (cost=1000.00..6611032.19 rows=1678309 width=180)
Workers Planned: 2
-> Nested Loop (cost=0.00..6442201.29 rows=699295 width=180)
" Join Filter: (A.content ~ similar_escape((('%'::text || B.subcontent) || '%'::text), NULL::text))"
-> Parallel Seq Scan on A (cost=0.00..8675.79 rows=186479 width=91)
-> Seq Scan on B (cost=0.00..19.50 rows=750 width=89)
i try use Large Text Indexes follow by this blog (Andrew Kane done a great job! )
https://medium.com/@ankane/large-text-indexes-in-postgres-5d7f1677f89f
i try to add an index on both of the columns that didn't help the execution takes 40 minutes, any ideas?
explain after using Laurenz Albe answer
Nested Loop (cost=27.79..100352.93 rows=1678309 width=180)
-> Seq Scan on q2_distinct (cost=0.00..19.50 rows=750 width=89)
-> Bitmap Heap Scan on clean_distinct (cost=27.79..111.40 rows=2238 width=91)
" Recheck Cond: (A.content ~ similar_escape((('%'::text || q2_distinct.part) || '%'::text), NULL::text))"
-> Bitmap Index Scan on "clean_distinct_Post content_idx" (cost=0.00..27.23 rows=2238 width=0)
" Index Cond: (A.content ~ similar_escape((('%'::text || B.content) || '%'::text), NULL::text))"
Upvotes: 1
Views: 1320
Reputation: 40683
Postgres' in-built textsearch functions could help here. But you will need to add a second column to table A to help index it. You will also need to know the language of each string in table A.
Example query.
postgres=# select to_tsvector('english', 'Big cats eat fish on Mondays.')
@@ phraseto_tsquery('english', 'eat fish on monday') as query;
query
-------
t
Note that the pluralisation and capitalisation of Mondays
did not stop the match.
Use the to_tsvector
function to create a parsed form of your content, which you can use for a GIN index.
postgres=# select to_tsvector('english', 'Big cats eat fish on Mondays.');
to_tsvector
---------------------------------------------
'big':1 'cat':2 'eat':3 'fish':4 'monday':6
Note how to_tsvector
normalises the string (changing to lowercase, removing punctuation, removing plurals, and removing low value words like "on"). This means "eat fish if monday" would also match (since "if" is also a low value word). So you will need to add a second check using the LIKE
operator if you only want an exact match. But this will run far less frequently than it would have without the tsvector
GIN index.
So your query would now look like,
SELECT subcontent, content
FROM A
JOIN B ON (
A.tsv_content @@ phraseto_tsquery(B.lang, B.subcontent)
AND content LIKE '%' || subcontent || '%'
);
NB. phraseto_tsquery
requires 9.6+
Upvotes: 1
Reputation: 246143
Any join that does not have a join condition with an equality operator (=
) can only use a nested loop join.
The only index that can potentially help here is a trigram index on A
:
CREATE EXTENSION pg_trgm;
CREATE INDEX ON "A" USING gin (content gin_trgm_ops);
But you shouldn't expect to get good execution times with a query like that.
Upvotes: 1