Luis
Luis

Reputation: 209

Debugging Levenshtein distance implementation - how is the min distance being calculated?

I have been trying to understand this code for while but I can't get how the distances are calculated. I know how the algorithm should work (it provides the edit distance in characters between two strings and I can do it on paper), but I don't understand how this line (as an example)

d(i, j) = d(i - 1, j - 1)

comes back as an integer. Same for min1, min2. We already define

d(i,0)

and

d(0,j)

, but how does it get a value when

d(i,j)

is not 0 in one of the two arguments?

Code:

Option Explicit
Public Function Levenshtein(s1 As String, s2 As String)

Dim i As Integer
Dim j As Integer
Dim l1 As Integer
Dim l2 As Integer
Dim d() As Integer
Dim min1 As Integer
Dim min2 As Integer

l1 = Len(s1)
l2 = Len(s2)
ReDim d(l1, l2)
For i = 0 To l1
    d(i, 0) = i
Next
For j = 0 To l2
    d(0, j) = j
Next
For i = 1 To l1
    For j = 1 To l2
        If Mid(s1, i, 1) = Mid(s2, j, 1) Then
            d(i, j) = d(i - 1, j - 1)   
        Else
            min1 = d(i - 1, j) + 1
            min2 = d(i, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            min2 = d(i - 1, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            d(i, j) = min1
        End If
    Next
Next
Levenshtein = d(l1, l2)
End Function

?Levenshtein("saturday","sunday")

3

Upvotes: 1

Views: 956

Answers (1)

FAB
FAB

Reputation: 2569

I don't have much clue about Levenshtein, however I can explain what's going on in that array, please see comments in the code:

l1 = Len(s1) 'this sets a variable with the length of the first string
l2 = Len(s2) 'this sets a variable with the length of the second string

ReDim d(l1, l2) 'redimension the array to this size, though this can be seen as ReDim d(0 to l1, 0 to l2)

For i = 0 To l1 'for i = 0 to length of first string
    d(i, 0) = i 'allocate the value of i to each row in the array, at col 0
Next i

For j = 0 To l2 'for j = 0 to length of second string
    d(0, j) = j 'allocate the value of j to each column in the array, at row 0
Next j

EDIT: I've added some debugging (will print to ActiveSheet). Blue for where the letters match, red for the others... though the colour doesn't matter.

As far as I can tell, with each loop iteration, it builds a matrix of the differences, starting from comparing 1:1 first letters, till the last. Based on the loop position, and with each iteration, it will use previous position values to calculate the difference at current position.

This is probably easier to follow with shorter strings.

  • Loop through each letter in second word, then loop through each letter in first word

  • at first outer loop (for each row), first inner loop (for each column), we have S meet S = 0. (d(i, j) = d(i - 1, j - 1) evaluates to d(i, j) = d(1 - 1, 1 - 1), i.e.: 0)

  • at first outer loop, second inner loop, we have S meet U = 1.

  • etc, etc.

enter image description here

Other than that, step through the code and see how the variables change based on the if conditions... not sure i can explain this any better.

Public Function Levenshtein(s1 As String, s2 As String)

Dim i As Integer, j As Integer
Dim l1 As Integer, l2 As Integer
Dim min1 As Integer, min2 As Integer
Dim d() As Integer

'For debugging purposes only
Cells.Clear
Dim rngOutput As Range: Set rngOutput = ActiveSheet.Range("A1").Resize(Len(s1) + 2, Len(s2) + 2)

With rngOutput
    .ColumnWidth = 3
    .HorizontalAlignment = xlCenter
    .VerticalAlignment = xlCenter
End With

l1 = Len(s1): l2 = Len(s2): ReDim d(l1, l2)

For i = 0 To l1
    d(i, 0) = i
    With rngOutput
        .Cells(i + 3, 1) = Mid(s1, i + 1, 1)
        If Not i = 0 Then .Cells(i + 2, 2) = i
    End With
Next i

For j = 0 To l2
    d(0, j) = j
    With rngOutput
        .Cells(1, j + 3) = Mid(s2, j + 1, 1)
        If Not j = 0 Then .Cells(2, j + 2) = j
    End With
Next j

For i = 1 To l1
    For j = 1 To l2
        If Mid(s1, i, 1) = Mid(s2, j, 1) Then
            d(i, j) = d(i - 1, j - 1)

            With rngOutput.Cells(i + 2, j + 2)
                .Value = d(i, j)
                .Font.Color = vbBlue
            End With
        Else
            min1 = d(i - 1, j) + 1
            min2 = d(i, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            min2 = d(i - 1, j - 1) + 1
            If min2 < min1 Then
                min1 = min2
            End If
            d(i, j) = min1

            With Cells(i + 2, j + 2)
                .Value = d(i, j)
                .Font.Color = vbRed
            End With
        End If
    Next
Next

Levenshtein = d(l1, l2)
End Function

Upvotes: 1

Related Questions