Why is pg_search trigram returning these unexpected results?

Question

I'm using the pg_search_scope feature of the Ruby pg_search gem, and I have a Listing model with the following scope:

pg_search_scope(
    :search,
    against: {
      title: "A",
      description: "B",
    },
    using: {
      tsearch: {
        tsvector_column: "title_description_tsvector",
        dictionary: "english",
        any_word: true,
      },
      trigram: {
        word_similarity: true,
        threshold: 0.4,
      },
    },
    ignoring: :accents,
  )

Using FactoryBot, I've created two instances of Listing in my test suite:

create(:listing, title: "Pink Cras")

create(:listing, description: "For sale, pink cras")

Given this setup, I'd expect there to be the following trigrams if I search for the term "cradle":

Cradle => c, cr, cra, rad, adl, dle, le, e

Cras => c, cr, cra, ras, as, s

All trigrams => C, cr, cra, rad, adl, dle, le, e, ras, as, s

Trigrams in common => C, cr, cra

There are 3 trigrams in common out of a total of 11, and 3 / 11 = .272727. Since my threshold is set to .4, I'd expect both of these instances to be filtered out of my results. However both of these instances are in fact included in the results.

Am I missing something about how trigrams are defined? Is my count of all trigrams (or common trigrams) incorrect?

Here is the SQL query that gets generated from my ActiveRecord query:

SELECT "listings".* FROM "listings" INNER JOIN (SELECT "listings"."id" AS pg_search_id, (ts_rank(("listings"."title_description_tsvector"), (to_tsquery('english', ''' ' || unaccent('cradle') || ' ''')), 0)) AS rank FROM "listings" WHERE (("listings"."title_description_tsvector") @@ (to_tsquery('english', ''' ' || unaccent('cradle') || ' '''))) OR (word_similarity(unaccent('cradle'), (unaccent(coalesce("listings"."title"::text, '') || ' ' || coalesce("listings"."description"::text, '')))) >= 0.4)) AS pg_search_8a836f245cd6a84ba9cbd1 ON "listings"."id" = pg_search_8a836f245cd6a84ba9cbd1.pg_search_id ORDER BY pg_search_8a836f245cd6a84ba9cbd1.rank DESC, "listings"."id" ASC LIMIT $1

jjanes · Accepted Answer

Cradle => c, cr, cra, rad, adl, dle, le, e

pg_trgm does not ever generate the double-space padded tigram at the end of the word, so the last trigram here is le, not e. This weights the beginning of words more than the ending, which is generally a reasonable thing to do.

Cras => c, cr, cra, ras, as, s

Similarly, there is no s trigram. Also, word_similarity, counter-intuitively, does not force the substring of the right hand argument to align on word boundaries. So it treats 'cras' as if it were 'cra', since that is the higher-scoring substring, so there is no ras or as. And since 'cra' is not aligned to a word boundary on the right hand side, it apparently does not make an ra either.

Maybe you want strict_word_similarity instead.

Why is pg_search trigram returning these unexpected results?

Answers (1)

Related Questions