Reputation: 34909
So I've got a column in a table that contains a string values (keywords populated from a 3rd party tool). I'm working on an automated tool to identify clusters of similar values that could probably be normalized to a single value. For example, "Firemen"/"Fireman", "Isotope"/"Asotope" or "Canine"/"Canines".
An approach that calculates the levenshtein distance seems ideal except for the fact that it involves too much string manipulation/comparison and would probably make poor use of SQL indexes.
I've considered incrementally grouping by the Left(X) characters of the column, which is a not-so-bad way to maximize index use, but this approach is really only effective at finding words with differences at the very end of the word.
Anyone got some good ideas for solving this problem efficiently in SQL?
Note: I realize this question is very similar to (Finding how similar two strings are), but the distinction here is the need to do this efficiently in SQL.
Upvotes: 0
Views: 657
Reputation: 11079
If you are using SQL Server, you might look into using the SOUNDEX() function as in:
...
where
SOUNDEX("searchterm") = SOUNDEX(searchvaluefield)
it is supposed to do Phonetic matching on the strings ...
Some odd examples ... so it seems you could catch plurals by always appending the plural text to both sides, since multiple 's's sound the same ... :-)
select soundex('Canine'), soundex('Canines')
go
----- -----
C550 C552
1 Row(s) affected
select soundex('Canine'), soundex('Caynyn')
go
----- -----
C550 C550
1 Row(s) affected
select soundex('Canines'), soundex('Caniness')
go
----- -----
C552 C552
1 Row(s) affected
Upvotes: 1
Reputation: 21695
John, if you are using MS SQL Server, you can take advantage of the Full-Text Indexing service. Full-text search functionality has some powerful functions using which you can achieve this.
Upvotes: 0
Reputation: 45117
You don't mention what DB your using, but if it's T-SQL, you could use the SOUNDEX value and difference.
Upvotes: 2