Reputation: 14780
Is it possible to search for similar words in SQL Server 2008 ?
If the user types: Ayrton Sena
with a single 'n'
it should also return rows with Ayrton Senna
with two 'nn'
I think the same method applies for spell checking words
Upvotes: 2
Views: 3310
Reputation: 988
I've been investigating a similar problem and have stumbled across a 'metaphone' algorithm that was created in 1990. It's effectively a more accurate version of Soundex and is useful for identifying phonetically similar words. It appears as an inbuilt function in some programming languages. Here's a SQL Server equivalent function by 'Phil Factor' that we've been using with some good success. It was reverse engineered from php's inbuilt metaphone function.
I've pasted a reformatted version below that I did just so I could read the code a bit easier.
IF OBJECT_ID('Utils.Metaphone','FN') IS NOT NULL --drop any existing metaphone function
DROP FUNCTION Utils.Metaphone
GO
CREATE FUNCTION Utils.Metaphone
(
@String VARCHAR(30)
)
RETURNS VARCHAR(10)
AS
BEGIN
DECLARE @New BIT
,@ii INT
,@Metaphone VARCHAR(28)
,@Len INT
,@Where INT;
DECLARE @This CHAR
,@Next CHAR
,@Following CHAR
,@Previous CHAR
,@Silent BIT;
SELECT @String = UPPER(LTRIM(COALESCE(@String, ''))); --trim and upper case
SELECT @Where = PATINDEX ('%[^A-Z]%', @String COLLATE Latin1_General_CI_AI )
WHILE @Where > 0 --strip out all non-alphabetic characters!
BEGIN
SELECT @String = STUFF(@string, @Where, 1, '')
SELECT @Where = PATINDEX ('%[^A-Z]%',@String COLLATE Latin1_General_CI_AI )
END
IF (LEN(@String) < 2) RETURN @String
--do the start of string stuff first.
--If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
-- "Aebersold", "Gnagy", "Knuth", "Pniewski", "Wright"
IF SUBSTRING(@String, 1, 2) IN ( 'KN', 'GN', 'PN', 'AE', 'WR' )
SELECT @String = STUFF(@String, 1, 1, '');
-- Beginning of word: "x" change to "s" as in "Deng Xiaopeng"
IF SUBSTRING(@String, 1, 1) = 'X'
SELECT @String = STUFF(@String, 1, 1, 'S');
-- Beginning of word: "wh-" change to "w" as in "Whatsoever"
IF @String LIKE 'WH%'
SELECT @String = STUFF(@String, 1, 1, 'W');
-- Set up for While loop
SELECT @Len = LEN(@String), @Metaphone = '' -- Initialize the main variable
,@New = 1 -- this variable only used next 10 lines!!!
,@ii = 1; --Position counter
--
WHILE((LEN(@Metaphone) <= 8) AND (@ii <= @Len))
BEGIN --SET up the 'pointers' for this loop-around }
SELECT @Previous = CASE WHEN @ii > 1 THEN SUBSTRING(@String, @ii - 1, 1) ELSE '' END
-- originally a nul terminated string }
,@This = SUBSTRING(@String, @ii, 1)
,@Next = CASE WHEN @ii < @Len THEN SUBSTRING(@String, @ii + 1, 1) ELSE '' END
,@Following = CASE WHEN((@ii + 1) < @Len) THEN SUBSTRING(@String, @ii + 2, 1) ELSE '' END
-- 'CC' inside word
/* Drop duplicate adjacent letters, except for C.*/
IF @This=@Previous AND @This<> 'C'
BEGIN
--we do nothing
SELECT @New=0
END
/*Drop all vowels unless it is the beginning.*/
ELSE IF @This IN ( 'A', 'E', 'I', 'O', 'U' )
BEGIN
IF @ii = 1 --vowel at the beginning
SELECT @Metaphone = @This;
/* B -> B unless at the end of word after "m", as in "dumb", "Comb" */
END;
ELSE IF @This = 'B' AND NOT ((@ii = @Len) AND (@Previous = 'M'))
BEGIN
SELECT @Metaphone = @Metaphone + 'B';
END;
-- -mb is silent
/*'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-',
in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'.
Otherwise, 'C' transforms to 'K'.*/
ELSE IF @This = 'C'
BEGIN -- -sce, i, y = silent
IF NOT (@Previous= 'S') AND (@Next IN ( 'H', 'E', 'I', 'Y' )) --front vowel set
BEGIN
IF (@Next = 'I') AND (@Following = 'A')
SELECT @Metaphone = @Metaphone + 'X'; -- -cia-
ELSE IF(@Next IN ( 'E', 'I', 'Y' ))
SELECT @Metaphone = @Metaphone + 'S'; -- -ce, i, y = 'S' }
ELSE IF(@Next = 'H') AND (@Previous = 'S')
SELECT @Metaphone = @Metaphone + 'K'; -- -sch- = 'K' }
ELSE IF(@Next = 'H')
BEGIN
IF(@ii = 1) AND ((@ii + 2) <= @Len) AND NOT(@Following IN ( 'A', 'E', 'I', 'O', 'U' ))
SELECT @Metaphone = @Metaphone + 'K';
ELSE
SELECT @Metaphone = @Metaphone + 'X';
END
END
ELSE
SELECT @Metaphone = @Metaphone +CASE WHEN @Previous= 'S' THEN '' ELSE 'K' END;
-- Else silent
END; -- Case C }
/*'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D'
transforms to 'T'.*/
ELSE IF @This = 'D'
BEGIN
SELECT @Metaphone = @Metaphone
+ CASE WHEN(@Next = 'G') AND (@Following IN ( 'E', 'I', 'Y' )) --front vowel set
THEN 'J'
ELSE 'T'
END;
END;
ELSE IF @This = 'G'
/*Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G'
if followed by 'N' or 'NED' and is at the end.
'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'.
Otherwise, 'G' transforms to 'K'.*/
BEGIN
SELECT @Silent = CASE WHEN (@Next = 'H')
AND (@Following IN ('A','E','I','O','U'))
AND (@ii > 1)
AND (
((@ii+1) = @Len)
OR
(
(@Next = 'n')
AND (@Following = 'E')
AND SUBSTRING(@String,@ii+3,1) = 'D'
)
AND ((@ii+3) = @Len)
)
-- Terminal -gned
AND (@Previous = 'i')
AND (@Next = 'n')
THEN 1
-- if not start and near -end or -gned.)
WHEN (@ii > 1)
AND (@Previous = 'D')-- gnuw
AND (@Next IN ('E','I','Y')) --front vowel set
THEN 1 -- -dge, i, y
ELSE 0
END
IF NOT(@Silent=1)
SELECT @Metaphone = @Metaphone
+ CASE WHEN (@Next IN ('E','I','Y')) --front vowel set
THEN 'J'
ELSE 'K'
END
END
/*Drop 'H' if after vowel and not before a vowel.
or the second char of "-ch-", "-sh-", "-ph-", "-th-", "-gh-"*/
ELSE IF @This = 'H'
BEGIN
IF NOT( (@ii= @Len) OR (@Previous IN ( 'C', 'S', 'T', 'G' )))
AND (@Next IN ( 'A', 'E', 'I', 'O', 'U' ) )
SELECT @Metaphone = @Metaphone + 'H';
-- else silent (vowel follows) }
END;
ELSE IF @This IN --some get no substitution
( 'F', 'J', 'L', 'M', 'N', 'R' )
BEGIN
SELECT @Metaphone = @Metaphone + @This;
END;
/*'CK' transforms to 'K'.*/
ELSE IF @This = 'K'
BEGIN
IF (@Previous <> 'C')
SELECT @Metaphone = @Metaphone + 'K';
END;
/*'PH' transforms to 'F'.*/
ELSE IF @This = 'P'
BEGIN
IF(@Next = 'H')
SELECT @Metaphone = @Metaphone + 'F', @ii = @ii + 1;
-- Skip the 'H'
ELSE
SELECT @Metaphone = @Metaphone + 'P';
END;
/*'Q' transforms to 'K'.*/
ELSE IF @This = 'Q'
BEGIN
SELECT @Metaphone = @Metaphone + 'K';
END;
/*'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.*/
ELSE IF @This = 'S'
BEGIN
SELECT @Metaphone = @Metaphone
+ CASE WHEN (@Next = 'H')
OR
(
(@ii> 1)
AND (@Next = 'i')
AND (@Following IN ( 'O', 'A' ) )
)
THEN 'X'
ELSE 'S'
END;
END;
/*'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms
to '0'. Drop 'T' if followed by 'CH'.*/
ELSE IF @This = 'T'
BEGIN
SELECT @Metaphone = @Metaphone
+ CASE WHEN (@ii = 1)
AND (@Next = 'H')
AND (@Following = 'O')
THEN 'T' -- Initial Tho- }
WHEN (@ii > 1)
AND (@Next = 'i')
AND (@Following IN ( 'O', 'A' ))
THEN 'X'
WHEN (@Next = 'H')
THEN '0'
WHEN NOT((@Next = 'C')
AND (@Following = 'H'))
THEN 'T'
ELSE ''
END;
-- -tch = silent }
END;
/*'V' transforms to 'F'.*/
ELSE IF @This = 'V'
BEGIN
SELECT @Metaphone = @Metaphone + 'F';
END;
/*'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.*/
/*Drop 'Y' if not followed by a vowel.*/
ELSE IF @This IN ( 'W', 'Y' )
BEGIN
IF @Next IN ( 'A', 'E', 'I', 'O', 'U' )
SELECT @Metaphone = @Metaphone + @This;
--else silent
/*'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.*/
END;
ELSE IF @This = 'X'
BEGIN
SELECT @Metaphone = @Metaphone + 'KS';
END;
/*'Z' transforms to 'S'.*/
ELSE IF @This = 'Z'
BEGIN
SELECT @Metaphone = @Metaphone + 'S';
END;
ELSE
RETURN 'error with '''+ @This+ '''';
-- end
SELECT @ii = @ii + 1;
END; -- While
RETURN @Metaphone
END
The following Tests all generate the same result.
SELECT Utils.Metaphone('Aryton Sena')
,Utils.Metaphone('Aryton Senna')
,Utils.Metaphone('Ayrton Senna')
,Utils.Metaphone('Ayrton Sena')
,Utils.Metaphone('Aryten Sena')
,Utils.Metaphone('Aryten Senna')
,Utils.Metaphone('Ayrten Senna')
,Utils.Metaphone('Ayrten Sena');
result:
ARTNSN
Upvotes: 2
Reputation: 4622
As 'Senna' is not a reflection of 'Sena', it is difficult to solve this task using full text indexing.
I recommend using a combination of full text and string similarity to decide, whether two strings are considered 'equal'.
So if you search for more than one word and you allow one of them to be misspelled, use something like this
select *
from myTable t
join FullTextTable(myTable,TextField,'Ayrton Senna') f
on f.ID=t.PK
where dbo.MyExternalStringSimilarity('Ayrton Senna', t.TextField)>0.9
Now all you need is a string similarity function. You can use the 'Similarity' function found in the microsoft data quality services or write your own.
Look for Jaro-Winkler, Levenshtein, Dice-Coefficient etc. These are good algorithms to do string similarity comparisons.
Of course you could also scan your whole database using
select *
from myTable t
where dbo.MyExternalStringSimilarity('Ayrton Senna', t.TextField)>0.9
But this might take too long to perform.
Edit: However, we are currently using the first approach to find all similar spellings of a name. It works great.
Upvotes: 2
Reputation: 5403
Spell checkers usually work by having dictionaries that words are looked up in. If your word exactly matches a word in the dictionary then it is spelt correctly. If not then the closest match is found and this is suggested as a replacement. Some spell checkers hold alternative spellings or common mis-spellings but this doesn't fundamentally change the way they work.
Jaro-Winkler is a distance measure, in that it measures the "distance" between two words, i.e. how many transpositions have to be made to get from the first word to the second. Jaro is commonly used for matching people's names as this is what it excels at. It can also be used for more general matching but you need to be careful about abbreviations, etc. as these can confuse it.
Performance shouldn't be an issue. I usually implement the Jaro Winkler algorithm in a .NET application as it is tricky to write as a SQL UDF. You could also use an external CLR stored procedure I suppose? This performed fine when matching across tens of thousands of records. If you are going to potentially be matching millions of names then performance might be more of a concern?
Here is an example of how you might approach this: http://isolvable.blogspot.co.uk/2011/05/jaro-winkler-fast-fuzzy-linkage.html
Upvotes: 0
Reputation: 24526
Look at Full Text Search. This allows all sorts of searching including different word forms. You can configure word forms or use an out-of-the-box dictionary.
Quote (emphasis mine) :
Full-text queries perform linguistic searches against text data in full-text indexes by operating on words and phrases based on rules of a particular language such as English or Japanese. Full-text queries can include simple words and phrases or multiple forms of a word or phrase.
See this answer regarding thesaurus.
An alternate to full text search is Lucene.
Upvotes: 0