Computation of pg_trgm similarity

Question

I would like to know what kind of similarity function is used in the case of PostgreSQL pg_trgm extension. My initial assumption was that it computes the similarity of two strings s1 and s2 using the following formula:

sim(s1, s2) = |G3(s1) ⋂ G3(s2)| / max(|G(s1)|, |G(s2)|)

where G3 is a set of 3-grams for a string. I tried several examples and it seems that the computation is somehow different in PostgreSQL.

create extension pg_trgm;

create table doc (
    word text
);

insert into doc values ('bbcbb');

select *, similarity(word, 'bcb') from doc;

The above example returns 0.25. However,

G3('bbcbb') = {##b, #bb, bbc, bcb, cbb, bb#, b##}
G3('bcb') = {##b, #bc, bcb, cb#, b##}
|G3(s1) ⋂ G3(s2)| = 3
max(|G(s1)|, |G(s2)|) = 7

therefore the sim formula does not return 0.25. What is the correct formula?

Computation of pg_trgm similarity

Answers (1)

Related Questions