Radim Bača
Radim Bača

Reputation: 10701

Computation of pg_trgm similarity

I would like to know what kind of similarity function is used in the case of PostgreSQL pg_trgm extension. My initial assumption was that it computes the similarity of two strings s1 and s2 using the following formula:

sim(s1, s2) = |G3(s1) ⋂ G3(s2)| / max(|G(s1)|, |G(s2)|)

where G3 is a set of 3-grams for a string. I tried several examples and it seems that the computation is somehow different in PostgreSQL.

create extension pg_trgm;

create table doc (
    word text
);

insert into doc values ('bbcbb');

select *, similarity(word, 'bcb') from doc;

The above example returns 0.25. However,

G3('bbcbb') = {##b, #bb, bbc, bcb, cbb, bb#, b##}
G3('bcb') = {##b, #bc, bcb, cb#, b##}
|G3(s1) ⋂ G3(s2)| = 3
max(|G(s1)|, |G(s2)|) = 7

therefore the sim formula does not return 0.25. What is the correct formula?

Upvotes: 2

Views: 952

Answers (1)

I think u need to check this: PostgreSQL, trigrams and similarity

Summ:

SELECT show_trgm('bbcbb'), show_trgm('bcb')

'bbcbb' = {" b"," bb","bb ",bbc,bcb,cbb}
'bcb' = {" b"," bc",bcb,"cb "}

'bbcbb' -> 6 element 'bcb' -> 4 element

SELECT UNNEST(show_trgm('bbcbb')) INTERSECT SELECT UNNEST(show_trgm('bcb'))

Result: 2 elements are common

SELECT UNNEST(show_trgm('bbcbb')) UNION SELECT UNNEST(show_trgm('bcb'))

Result: 8 elements

Similarity 2/8 = 0.25

UPDATE: https://www.postgresql.org/docs/13/pgtrgm.html

In the F.31.1. Trigram (or Trigraph) Concepts section

There is a note says:

"...Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string..."

Upvotes: 3

Related Questions