Supporting typos in Postgres FTS

Question

This returns true

SELECT to_tsvector('The quick brown fox jumped over the lazy j-80 dog')  
    @@ to_tsquery('j-80');

These return false:

-- no minus char
SELECT to_tsvector('The quick brown fox jumped over the lazy j-80 dog')  
    @@ to_tsquery('j80');

-- a typo, typing 9 instead of 8
SELECT to_tsvector('The quick brown fox jumped over the lazy j-80 dog')  
    @@ to_tsquery('j90');

-- the user searches with a space 'j 80'
SELECT to_tsvector('The quick brown fox jumped over the lazy j-80 dog')  
    @@ to_tsquery('j & 80');

How do i improve the queries or maybe the tsvector so that i get true for all of the above ?

jjanes · Accepted Answer

It is hard to operate effectively on an unannotated mixture of ordinary English and technical jargon, like part numbers. Adding in the shortness of the part numbers, the inconsistent punctuation (particularly if the part number can have embedded spaces), and the possibilities of misspellings, and it all adds up to a very hard problem. If you can somehow extract the part numbers into their own column and standardize the punctuation both in that column and in the query (by removing all punctuation, for example), then you can use a pg_trgm index or operators. But with the part number being only 3 characters longs, you still don't have much to go with. For example, j80 and j90 are just barely related at all in the trigram algorithm:

create extension if not exists pg_trgm;
select similarity('j80', 'j90');
 similarity 
------------
   0.142857

Basically, they both start with j is all you have there. (They also both end with 0, but trigrams need at least 2 characters at the end of a word to be the same to consider it a match--beginnings have more weight than endings).

Supporting typos in Postgres FTS

Answers (1)

Related Questions