Reputation: 10701
I would like to know what kind of similarity function is used in the case of PostgreSQL pg_trgm
extension. My initial assumption was that it computes the similarity of two strings s1 and s2 using the following formula:
sim(s1, s2) = |G3(s1) ⋂ G3(s2)| / max(|G(s1)|, |G(s2)|)
where G3 is a set of 3-grams for a string. I tried several examples and it seems that the computation is somehow different in PostgreSQL.
create extension pg_trgm;
create table doc (
word text
);
insert into doc values ('bbcbb');
select *, similarity(word, 'bcb') from doc;
The above example returns 0.25. However,
G3('bbcbb') = {##b, #bb, bbc, bcb, cbb, bb#, b##}
G3('bcb') = {##b, #bc, bcb, cb#, b##}
|G3(s1) ⋂ G3(s2)| = 3
max(|G(s1)|, |G(s2)|) = 7
therefore the sim
formula does not return 0.25. What is the correct formula?
Upvotes: 2
Views: 952
Reputation: 131
I think u need to check this: PostgreSQL, trigrams and similarity
Summ:
SELECT show_trgm('bbcbb'), show_trgm('bcb')
'bbcbb' = {" b"," bb","bb ",bbc,bcb,cbb}
'bcb' = {" b"," bc",bcb,"cb "}
'bbcbb' -> 6 element 'bcb' -> 4 element
SELECT UNNEST(show_trgm('bbcbb')) INTERSECT SELECT UNNEST(show_trgm('bcb'))
Result: 2 elements are common
SELECT UNNEST(show_trgm('bbcbb')) UNION SELECT UNNEST(show_trgm('bcb'))
Result: 8 elements
Similarity 2/8 = 0.25
UPDATE: https://www.postgresql.org/docs/13/pgtrgm.html
In the F.31.1. Trigram (or Trigraph) Concepts section
There is a note says:
"...Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string..."
Upvotes: 3