tim_xyz
tim_xyz

Reputation: 13491

Fuzzy match a substring within a larger string in Postgres

Is it possible to fuzzy match a substring within a larger string in Postgres?

Example:

For a search of colour (ou), return all records where the string includes color, colors or colour.

select 
  * 
from things
where fuzzy(color) in description;

id | description
----------------
1  | A red coloured car
2  | The garden
3  | Painting colors

=> return records 1 and 3

I was wondering if it's possible to combine both fuzzystrmatch and tsvector so that the fuzzy matching could be applied to each vectorized term?

Or if there is another approach?

Upvotes: 1

Views: 1236

Answers (1)

jjanes
jjanes

Reputation: 44167

You can do it of course, but I doubt it will be very useful:

select *,levenshtein(lexeme,'color')  from things, unnest(to_tsvector('english',description))
   order by levenshtein;

 id |    description     | lexeme | positions | weights | levenshtein 
----+--------------------+--------+-----------+---------+-------------
  3 | Painting colors    | color  | {2}       | {D}     |           0
  1 | A red coloured car | colour | {3}       | {D}     |           1
  1 | A red coloured car | car    | {4}       | {D}     |           3
  1 | A red coloured car | red    | {2}       | {D}     |           5
  3 | Painting colors    | paint  | {1}       | {D}     |           5
  2 | The garden         | garden | {2}       | {D}     |           6

Presumably you would want to embellish the query to apply some cutoff, probably where the cutoff depends on the lengths, and return only the best result for each description assuming it met that cutoff. Doing that should be just routine SQL manipulations.

Perhaps better would be the word similarity operators recently added to pg_trgm.

select *, description <->> 'color' as distance from things order by description <->> 'color';

 id |    description     | distance 
----+--------------------+----------
  3 | Painting colors    | 0.166667
  1 | A red coloured car | 0.333333
  2 | The garden         |        1

Another option would be to find a stemmer or thesaurus which standardizes British/American spellings (I am not aware of one readily available), and then not use fuzzy matching at all. I think this would be best, if you can do it.

Upvotes: 2

Related Questions