Reputation: 13491
Is it possible to fuzzy match a substring within a larger string in Postgres?
Example:
For a search of colour
(ou), return all records where the string includes color
, colors
or colour
.
select
*
from things
where fuzzy(color) in description;
id | description
----------------
1 | A red coloured car
2 | The garden
3 | Painting colors
=> return records 1 and 3
I was wondering if it's possible to combine both fuzzystrmatch
and tsvector
so that the fuzzy matching could be applied to each vectorized term?
Or if there is another approach?
Upvotes: 1
Views: 1236
Reputation: 44167
You can do it of course, but I doubt it will be very useful:
select *,levenshtein(lexeme,'color') from things, unnest(to_tsvector('english',description))
order by levenshtein;
id | description | lexeme | positions | weights | levenshtein
----+--------------------+--------+-----------+---------+-------------
3 | Painting colors | color | {2} | {D} | 0
1 | A red coloured car | colour | {3} | {D} | 1
1 | A red coloured car | car | {4} | {D} | 3
1 | A red coloured car | red | {2} | {D} | 5
3 | Painting colors | paint | {1} | {D} | 5
2 | The garden | garden | {2} | {D} | 6
Presumably you would want to embellish the query to apply some cutoff, probably where the cutoff depends on the lengths, and return only the best result for each description assuming it met that cutoff. Doing that should be just routine SQL manipulations.
Perhaps better would be the word similarity operators recently added to pg_trgm
.
select *, description <->> 'color' as distance from things order by description <->> 'color';
id | description | distance
----+--------------------+----------
3 | Painting colors | 0.166667
1 | A red coloured car | 0.333333
2 | The garden | 1
Another option would be to find a stemmer or thesaurus which standardizes British/American spellings (I am not aware of one readily available), and then not use fuzzy matching at all. I think this would be best, if you can do it.
Upvotes: 2