surfmuggle
surfmuggle

Reputation: 5944

Compare Names using Levenshtein distance

In my application i need to identify a person by searching for their lastname and firstname. One requirement is to accept spelling errors to a certain degree.

My attempts to identify a person given a firstname and lastname were:

  1. sql query using soundex
  2. sql query using levenshtein-distance (LD) which was calculated with this LD-function

The screenshot contains some test records and the result of my sql-query which includes the soundex-value for each column and the LD

compare records using levenshtein distance

My current query looks like this

SELECT t2.*
        , t1.Firstname + ' ' + t1.Lastname as SourceName
        , 'Torsten Mueller' as TargetName
        , dbo.FUNC_LEVENSHTEIN(t1.Firstname +' '+ t1.Lastname
                            , 'Torsten Mueller', 8) as LEVENSHTEIN_Distance     
 FROM #TestSoundex t1
 LEFT JOIN #TestSoundex t2 ON t1.Id = t2.Id
 WHERE t1.Soundex_Firstname = SOUNDEX('Torsten')
       AND t1.Soundex_Lastname = SOUNDEX('Mueller')

As you can see i filter the result by soundex first and calculate the levenshtein-distance for the remaining records. In this sample below the levenshtein-distance ranges from 0 (both strings are equal) to 3.

SourceName TargetName Levenshtein Distance
Thorsten Müller Torsten Mueller 3
Torsten Müller Torsten Mueller 2
Thorsten Mueller Torsten Mueller 1
Torsten Mueller Torsten Mueller 0

In this talk from a stanford professor (video no longer public) the calculation for the distance is explained:

I N T E * N TION 
| | | | | | | 
* E X E C U TION
d s s   i s

Each deletion d, insertion i adds 1 point, a substitution s adds 2 points. The LD-function i use returns 5 points for the sample above, but only 3 instead of 4 for the distance between Thorsten Müller and Torsten Mueller. I

+1 point to delete h, 
+1 point instead of 2 to substitute ü 
+1 point to insert e

So i added some samples

Samples with Umlaut

Questions

My impression is that neither soundex nor LD are sufficient to uniquely identify a person-record given firstname and lastname and taking into account that there might be spelling mismatches.

Source Code

This is the function i use from the linked answer. I only changed the name of the function from edit_distance_within to FUNC_LEVENSHTEIN

    SET QUOTED_IDENTIFIER ON 
    GO
    SET ANSI_NULLS ON 
    GO
    
    CREATE FUNCTION FUNC_LEVENSHTEIN(@s nvarchar(4000), @t nvarchar(4000), @d int)
    RETURNS int
    AS
    BEGIN
      DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,
        @cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int
      SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = '', @j = 1, @i = 1, @c = 0
      WHILE @j <= @tl
        SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1
      WHILE @i <= @sl
      BEGIN
        SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = '', @j = 1, @cmin = 4000
        WHILE @j <= @tl
        BEGIN
          SET @c = @c + 1
          SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END
          IF @c > @c1 SET @c = @c1
          SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1
          IF @c > @c1 SET @c = @c1
          IF @c < @cmin SET @cmin = @c
          SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1
        END
        IF @cmin > @d BREAK
        SELECT @cv1 = @cv0, @i = @i + 1
      END
      RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END
    END
    GO

Source to test function above

This is another test

    CREATE TABLE #TestLevenshteinDistance(
        Id int  IDENTITY(1,1) NOT NULL,
        SourceName nvarchar(100) NULL,  
        Soundex_SourceName varchar(4) NULL,    
        Targetname nvarchar(100) NULL, 
        Soundex_TargetName varchar(4) NULL, 
        );      
       
    INSERT INTO #TestLevenshteinDistance 
        (    SourceName,          
             Soundex_SourceName,
             Targetname,
             Soundex_TargetName) 
    VALUES 
       ('Intention',SOUNDEX('Intention'), 'Execution', SOUNDEX('Execution')),    
       ('Karsten' , SOUNDEX('Karsten'), 'Torsten', SOUNDEX('Torsten')); 
             
    
    SELECT t1.*
            , dbo.FUNC_LEVENSHTEIN(t1.SourceName, t1.Targetname, 8) as LEVENSHTEIN_Distance
            FROM #TestLevenshteinDistance t1

Update 2024

Based on the comment i tested DIFFERENCE in this sql online demo

Demo compare string with difference and Levenshtein Distance

The Difference function seems to be less strict than LD. The Difference value for two very

I added 'Foo' and 'Bar' and did not expect that the difference is just 2. I assumed a difference of 1 (less similar).

Upvotes: 7

Views: 6323

Answers (0)

Related Questions