Reputation: 34909

Performant techniques for finding similar values in SQL?

So I've got a column in a table that contains a string values (keywords populated from a 3rd party tool). I'm working on an automated tool to identify clusters of similar values that could probably be normalized to a single value. For example, "Firemen"/"Fireman", "Isotope"/"Asotope" or "Canine"/"Canines".

An approach that calculates the levenshtein distance seems ideal except for the fact that it involves too much string manipulation/comparison and would probably make poor use of SQL indexes.

I've considered incrementally grouping by the Left(X) characters of the column, which is a not-so-bad way to maximize index use, but this approach is really only effective at finding words with differences at the very end of the word.

Anyone got some good ideas for solving this problem efficiently in SQL?

Note: I realize this question is very similar to (Finding how similar two strings are), but the distinction here is the need to do this efficiently in SQL.

Upvotes: 0

Answers (3)

Ron Savage

Reputation: 11079

If you are using SQL Server, you might look into using the SOUNDEX() function as in:

...
where
   SOUNDEX("searchterm") = SOUNDEX(searchvaluefield)

it is supposed to do Phonetic matching on the strings ...

Some odd examples ... so it seems you could catch plurals by always appending the plural text to both sides, since multiple 's's sound the same ... :-)

select soundex('Canine'), soundex('Canines')
go

----- ----- 
C550  C552  

1 Row(s) affected


select soundex('Canine'), soundex('Caynyn')
go

----- ----- 
C550  C550  

1 Row(s) affected


select soundex('Canines'), soundex('Caniness')
go

----- ----- 
C552  C552  

1 Row(s) affected

Upvotes: 1

Kirtan

Reputation: 21695

John, if you are using MS SQL Server, you can take advantage of the Full-Text Indexing service. Full-text search functionality has some powerful functions using which you can achieve this.

Upvotes: 0

JP Alioto

Reputation: 45117

You don't mention what DB your using, but if it's T-SQL, you could use the SOUNDEX value and difference.

Upvotes: 2

Performant techniques for finding similar values in SQL?

Answers (3)

Related Questions