Reputation: 11
I am using trigram similarity in Postgres to help me search flexibly for names in a DB and (more importantly) extract names from natural language sentences and match them to DB records.
I got the first task to work reliably and WICKED fast using this query:
SELECT *, similarity(name_column, 'name im searching for') AS sim
FROM table_with_names
WHERE name_column % 'name im searching for'
ORDER BY sim DESC
LIMIT 5;
The above query utilizes an index on the name_column
made with this statement:
CREATE INDEX name_sim_idx ON table_with_names USING GIN (name_column gin_trgm_ops);
My other task (extracting name matches flexibly from full sentences) is frustrating me. It seems that the pg_trgm module which contains the similarity()
function also has a word_similarity()
function which does EXACTLY what I need to be able to do. In fact, I accomplish the task with this query:
SELECT *, word_similarity(name_column, 'sentence including the name im searching for') AS sim
FROM table_with_names
WHERE name_column <% 'sentence including the name im searching for'
ORDER BY sim DESC
LIMIT 5;
HOWEVER. While the first query I listed (similarity lookup, not extraction) is super fast (1ms) this second query is painfully slow at ~350ms and will not use my index.
I do not understand why the second query will not use my trgm index. I have tried suppressing Seq Scan using SET enable_seqscan = off;
, but this dos not work. From what I can tell, Postgres's docs claim that this <%
operator is the correct operator to use if you want to use your index to accelerate word_similarity()
queries, but all of their examples use it in the opposite direction.
For example, the docs show:
WHERE 'search text' <% column
whereas I need to do the opposite:
WHERE column <% 'search text
Is what I want to do possible with trgm similarity? Or am I spinning my wheels here. I can't imagine why my index wouldn't be able to be used here. It makes 0 sense to me. Hoping someone can help me clear this up! Thanks in advance.
EDIT: Here is the execution plan as suggested by a_horse_with_no_name
Limit (cost=10000000538.89..10000000538.90 rows=5 width=114) (actual time=349.292..349.295 rows=0 loops=1)
Buffers: shared hit=407
-> Sort (cost=10000000538.89..10000000538.91 rows=11 width=114) (actual time=349.290..349.292 rows=0 loops=1)
Sort Key: (word_similarity(full_name, 'this is the sentence that contains the name i am trying to extract'::text)) DESC, period_start DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=407
-> Seq Scan on patient_matching_lookup_v1 (cost=10000000000.00..10000000538.70 rows=11 width=114) (actual time=349.280..349.281 rows=0 loops=1)
Filter: (full_name <% 'this is the sentence that contains the name i am trying to extract'::text)
Rows Removed by Filter: 10534
Buffers: shared hit=407
Planning Time: 0.172 ms
Execution Time: 349.335 ms
Upvotes: 1
Views: 1326
Reputation: 246443
A trigram index will support a condition 'constant' <% indexed_col
, but not indexed_col <% 'constant'
.
Upvotes: 1