Mario
Mario

Reputation: 14780

SQL Server 2008 SELECT similar words

Is it possible to search for similar words in SQL Server 2008 ?

If the user types: Ayrton Sena

with a single 'n' it should also return rows with Ayrton Senna with two 'nn'

I think the same method applies for spell checking words

Upvotes: 2

Views: 3310

Answers (4)

Drammy
Drammy

Reputation: 988

I've been investigating a similar problem and have stumbled across a 'metaphone' algorithm that was created in 1990. It's effectively a more accurate version of Soundex and is useful for identifying phonetically similar words. It appears as an inbuilt function in some programming languages. Here's a SQL Server equivalent function by 'Phil Factor' that we've been using with some good success. It was reverse engineered from php's inbuilt metaphone function.

I've pasted a reformatted version below that I did just so I could read the code a bit easier.

IF  OBJECT_ID('Utils.Metaphone','FN') IS NOT NULL --drop any existing metaphone function
   DROP FUNCTION Utils.Metaphone
GO
CREATE FUNCTION Utils.Metaphone
(
    @String VARCHAR(30)
)
RETURNS VARCHAR(10)
AS
BEGIN
DECLARE @New BIT
        ,@ii INT
        ,@Metaphone VARCHAR(28)
        ,@Len INT
        ,@Where INT;
DECLARE @This CHAR
        ,@Next CHAR
        ,@Following CHAR
        ,@Previous CHAR
        ,@Silent BIT;

SELECT  @String = UPPER(LTRIM(COALESCE(@String, ''))); --trim and upper case
SELECT  @Where = PATINDEX ('%[^A-Z]%', @String COLLATE Latin1_General_CI_AI ) 
WHILE   @Where > 0 --strip out all non-alphabetic characters!
BEGIN
    SELECT @String = STUFF(@string, @Where, 1, '')
    SELECT @Where = PATINDEX ('%[^A-Z]%',@String COLLATE Latin1_General_CI_AI ) 
END
IF  (LEN(@String) < 2) RETURN  @String

--do the start of string stuff first.
--If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
-- "Aebersold", "Gnagy", "Knuth", "Pniewski", "Wright"
IF  SUBSTRING(@String, 1, 2) IN ( 'KN', 'GN', 'PN', 'AE', 'WR' )
    SELECT @String = STUFF(@String, 1, 1, '');
-- Beginning of word: "x" change to "s" as in "Deng Xiaopeng"
IF  SUBSTRING(@String, 1, 1) = 'X'
    SELECT @String = STUFF(@String, 1, 1, 'S');
-- Beginning of word: "wh-" change to "w" as in "Whatsoever"
IF  @String LIKE 'WH%'
    SELECT @String = STUFF(@String, 1, 1, 'W');
-- Set up for While loop 
SELECT  @Len = LEN(@String), @Metaphone = '' -- Initialize the main variable 
        ,@New = 1 -- this variable only used next 10 lines!!!
        ,@ii = 1; --Position counter
--
WHILE((LEN(@Metaphone) <= 8) AND (@ii <= @Len))
BEGIN --SET up the 'pointers' for this loop-around }
    SELECT  @Previous = CASE WHEN @ii > 1 THEN SUBSTRING(@String, @ii - 1, 1) ELSE '' END
            -- originally a nul terminated string }
            ,@This = SUBSTRING(@String, @ii, 1)
            ,@Next = CASE WHEN @ii < @Len THEN SUBSTRING(@String, @ii + 1, 1) ELSE '' END
            ,@Following = CASE WHEN((@ii + 1) < @Len) THEN SUBSTRING(@String, @ii + 2, 1) ELSE '' END

    -- 'CC' inside word
    /* Drop duplicate adjacent letters, except for C.*/
    IF  @This=@Previous AND @This<> 'C' 
    BEGIN
        --we do nothing 
        SELECT  @New=0
    END
    /*Drop all vowels unless it is the beginning.*/
    ELSE IF @This IN ( 'A', 'E', 'I', 'O', 'U' )
    BEGIN
        IF  @ii = 1 --vowel at the beginning
            SELECT @Metaphone = @This;
            /* B -> B unless at the end of word after "m", as in "dumb", "Comb" */
    END;
    ELSE IF @This = 'B' AND NOT ((@ii = @Len) AND (@Previous = 'M'))
    BEGIN
        SELECT @Metaphone = @Metaphone + 'B';
    END;
    -- -mb is silent
    /*'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-',
    in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. 
    Otherwise, 'C' transforms to 'K'.*/
    ELSE IF @This = 'C'
    BEGIN -- -sce, i, y = silent 
        IF NOT (@Previous= 'S') AND (@Next IN ( 'H', 'E', 'I', 'Y' )) --front vowel set 
        BEGIN
            IF  (@Next = 'I') AND (@Following = 'A')
                SELECT @Metaphone = @Metaphone + 'X'; -- -cia- 
            ELSE IF(@Next IN ( 'E', 'I', 'Y' ))
                SELECT @Metaphone = @Metaphone + 'S'; -- -ce, i, y = 'S' }
            ELSE IF(@Next = 'H') AND (@Previous = 'S')
                SELECT @Metaphone = @Metaphone + 'K'; -- -sch- = 'K' }
            ELSE IF(@Next = 'H')
            BEGIN
                IF(@ii = 1) AND ((@ii + 2) <= @Len) AND NOT(@Following IN ( 'A', 'E', 'I', 'O', 'U' ))
                    SELECT @Metaphone = @Metaphone + 'K';
                ELSE
                    SELECT @Metaphone = @Metaphone + 'X';
            END
        END
        ELSE 
            SELECT @Metaphone = @Metaphone +CASE WHEN @Previous= 'S' THEN '' ELSE 'K' END;
        -- Else silent 
    END; -- Case C }
    /*'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' 
    transforms to 'T'.*/
    ELSE IF @This = 'D'
    BEGIN
        SELECT @Metaphone = @Metaphone
                + CASE WHEN(@Next = 'G') AND (@Following IN ( 'E', 'I', 'Y' )) --front vowel set 
                        THEN 'J'
                        ELSE 'T'
                  END;
    END;
    ELSE IF @This = 'G'
    /*Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' 
    if followed by 'N' or 'NED' and is at the end.
    'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. 
    Otherwise, 'G' transforms to 'K'.*/
    BEGIN
        SELECT @Silent = CASE WHEN (@Next = 'H')
                                    AND (@Following IN ('A','E','I','O','U'))
                                    AND (@ii > 1)
                                    AND (
                                                ((@ii+1) = @Len)
                                            OR 
                                                (
                                                    (@Next = 'n')
                                                    AND (@Following = 'E')
                                                    AND SUBSTRING(@String,@ii+3,1) = 'D'
                                                )
                                            AND ((@ii+3) = @Len)
                                        )
                                    -- Terminal -gned 
                                    AND (@Previous = 'i')
                                    AND (@Next = 'n')
                                THEN 1 
                                -- if not start and near -end or -gned.) 
                              WHEN (@ii > 1)
                                    AND (@Previous = 'D')-- gnuw
                                    AND (@Next IN ('E','I','Y')) --front vowel set 
                                THEN 1 -- -dge, i, y 
                              ELSE 0
                          END
        IF  NOT(@Silent=1)
            SELECT @Metaphone = @Metaphone
                    + CASE WHEN (@Next IN ('E','I','Y')) --front vowel set 
                            THEN 'J'
                            ELSE 'K'
                      END
    END
    /*Drop 'H' if after vowel and not before a vowel.
    or the second char of  "-ch-", "-sh-", "-ph-", "-th-", "-gh-"*/

    ELSE IF @This = 'H'
    BEGIN
        IF  NOT( (@ii= @Len) OR (@Previous IN  ( 'C', 'S', 'T', 'G' ))) 
            AND (@Next IN ( 'A', 'E', 'I', 'O', 'U' ) )
            SELECT @Metaphone = @Metaphone + 'H';
            -- else silent (vowel follows) }
    END;
    ELSE IF @This IN --some get no substitution
                ( 'F', 'J', 'L', 'M', 'N', 'R' )
    BEGIN
        SELECT  @Metaphone = @Metaphone + @This;
    END;
    /*'CK' transforms to 'K'.*/
    ELSE IF @This = 'K'
    BEGIN
        IF  (@Previous <> 'C')
            SELECT @Metaphone = @Metaphone + 'K';
    END;
    /*'PH' transforms to 'F'.*/
    ELSE IF @This = 'P'
    BEGIN
        IF(@Next = 'H')
            SELECT @Metaphone = @Metaphone + 'F', @ii = @ii + 1;
        -- Skip the 'H' 
        ELSE
            SELECT @Metaphone = @Metaphone + 'P';
    END;
    /*'Q' transforms to 'K'.*/
    ELSE IF @This = 'Q'
    BEGIN
        SELECT @Metaphone = @Metaphone + 'K';
    END;
    /*'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.*/
    ELSE IF @This = 'S'
    BEGIN
        SELECT @Metaphone = @Metaphone
                + CASE WHEN (@Next = 'H')
                            OR
                                (
                                    (@ii> 1)
                                    AND (@Next = 'i') 
                                    AND (@Following IN ( 'O', 'A' ) )
                                )
                        THEN 'X'
                        ELSE 'S'
                  END;
    END;
    /*'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms 
    to '0'. Drop 'T' if followed by 'CH'.*/
    ELSE IF @This = 'T'
    BEGIN
        SELECT @Metaphone = @Metaphone
                + CASE WHEN (@ii = 1)
                            AND (@Next = 'H')
                            AND (@Following = 'O') 
                        THEN 'T' -- Initial Tho- }
                       WHEN (@ii > 1)
                            AND (@Next = 'i') 
                            AND (@Following IN ( 'O', 'A' )) 
                        THEN 'X'
                       WHEN (@Next = 'H')
                        THEN '0'
                       WHEN NOT((@Next = 'C')
                                AND (@Following = 'H')) 
                        THEN 'T'
                       ELSE ''
                  END;
                -- -tch = silent }
    END;
    /*'V' transforms to 'F'.*/
    ELSE IF @This = 'V'
    BEGIN
        SELECT @Metaphone = @Metaphone + 'F';
    END;
    /*'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.*/
    /*Drop 'Y' if not followed by a vowel.*/
    ELSE IF @This IN ( 'W', 'Y' )
    BEGIN
    IF @Next IN ( 'A', 'E', 'I', 'O', 'U' )
        SELECT @Metaphone = @Metaphone + @This;
    --else silent 
    /*'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.*/
    END;
    ELSE IF @This = 'X'
    BEGIN
        SELECT @Metaphone = @Metaphone + 'KS';
    END;
    /*'Z' transforms to 'S'.*/
    ELSE IF @This = 'Z'
    BEGIN
        SELECT @Metaphone = @Metaphone + 'S';
    END;
    ELSE
        RETURN 'error with '''+ @This+ '''';
    -- end
    SELECT @ii = @ii + 1;
END; -- While 
RETURN @Metaphone 
END

The following Tests all generate the same result.

SELECT  Utils.Metaphone('Aryton Sena')
        ,Utils.Metaphone('Aryton Senna')
        ,Utils.Metaphone('Ayrton Senna')
        ,Utils.Metaphone('Ayrton Sena')
        ,Utils.Metaphone('Aryten Sena')
        ,Utils.Metaphone('Aryten Senna')
        ,Utils.Metaphone('Ayrten Senna')
        ,Utils.Metaphone('Ayrten Sena');

result:

ARTNSN

Upvotes: 2

alzaimar
alzaimar

Reputation: 4622

As 'Senna' is not a reflection of 'Sena', it is difficult to solve this task using full text indexing.

I recommend using a combination of full text and string similarity to decide, whether two strings are considered 'equal'.

So if you search for more than one word and you allow one of them to be misspelled, use something like this

select * 
  from myTable t 
       join FullTextTable(myTable,TextField,'Ayrton Senna') f 
         on f.ID=t.PK
where dbo.MyExternalStringSimilarity('Ayrton Senna', t.TextField)>0.9

Now all you need is a string similarity function. You can use the 'Similarity' function found in the microsoft data quality services or write your own.

Look for Jaro-Winkler, Levenshtein, Dice-Coefficient etc. These are good algorithms to do string similarity comparisons.

Of course you could also scan your whole database using

select *
 from myTable t
 where dbo.MyExternalStringSimilarity('Ayrton Senna', t.TextField)>0.9

But this might take too long to perform.

Edit: However, we are currently using the first approach to find all similar spellings of a name. It works great.

Upvotes: 2

Richard Hansell
Richard Hansell

Reputation: 5403

Spell checkers usually work by having dictionaries that words are looked up in. If your word exactly matches a word in the dictionary then it is spelt correctly. If not then the closest match is found and this is suggested as a replacement. Some spell checkers hold alternative spellings or common mis-spellings but this doesn't fundamentally change the way they work.

Jaro-Winkler is a distance measure, in that it measures the "distance" between two words, i.e. how many transpositions have to be made to get from the first word to the second. Jaro is commonly used for matching people's names as this is what it excels at. It can also be used for more general matching but you need to be careful about abbreviations, etc. as these can confuse it.

Performance shouldn't be an issue. I usually implement the Jaro Winkler algorithm in a .NET application as it is tricky to write as a SQL UDF. You could also use an external CLR stored procedure I suppose? This performed fine when matching across tens of thousands of records. If you are going to potentially be matching millions of names then performance might be more of a concern?

Here is an example of how you might approach this: http://isolvable.blogspot.co.uk/2011/05/jaro-winkler-fast-fuzzy-linkage.html

Upvotes: 0

Paul Fleming
Paul Fleming

Reputation: 24526

Look at Full Text Search. This allows all sorts of searching including different word forms. You can configure word forms or use an out-of-the-box dictionary.

Quote (emphasis mine) :

Full-text queries perform linguistic searches against text data in full-text indexes by operating on words and phrases based on rules of a particular language such as English or Japanese. Full-text queries can include simple words and phrases or multiple forms of a word or phrase.

See this answer regarding thesaurus.

An alternate to full text search is Lucene.

Upvotes: 0

Related Questions