larryltj
larryltj

Reputation: 15

Fuzzy string matching optimization (not checking certain words) - Excel VBA function

I have a function in Excel that calculates the Levenshtein Distance between two strings (the number of insertions, deletions, and/or substitutions needed to transform one string into another). I am using this as part of a project I'm working on that involves "fuzzy string matching."

Below you will see the code for the LevenshteinDistance function and a valuePhrase function. The latter exists for the purposes of executing the function in my spreadsheet. I have taken this from what I read in this thread.

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)`

Public Function LevenshteinDistance(ByRef S1 As String, ByVal S2 As String) As Long
    Dim L1 As Long, L2 As Long, D() As Long 'Length of input strings and distance matrix
    Dim i As Long, j As Long, cost As Long 'loop counters and cost of 
        'substitution for current letter
    Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and 
        Substitution

    L1 = Len(S1): L2 = Len(S2)
    ReDim D(0 To L1, 0 To L2)
    For i = 0 To L1: D(i, 0) = i: Next i
    For j = 0 To L2: D(0, j) = j: Next j

    For j = 1 To L2
        For i = 1 To L1
            cost = Abs(StrComp(Mid$(S1, i, 1), Mid$(S2, j, 1), vbTextCompare))
            cI = D(i - 1, j) + 1
            cD = D(i, j - 1) + 1
            cS = D(i - 1, j - 1) + cost
            If cI <= cD Then 'Insertion or Substitution
                If cI <= cS Then D(i, j) = cI Else D(i, j) = cS
            Else 'Deletion or Substitution
                If cD <= cS Then D(i, j) = cD Else D(i, j) = cS
            End If
        Next i
    Next j
    LevenshteinDistance = D(L1, L2)

End Function

Public Function valuePhrase#(ByRef S1$, ByRef S2$)

    valuePhrase = LevenshteinDistance(S1, S2)

End Function

I am executing this valuePhrase function in a table in one of my sheets where the column and row headers are names of insurance companies. Ideally, the smallest number in any given row (the shortest Levenshtein distance) should correspond to a column header with the name of the insurance company in the table that most closely matches the name of that insurance company in the row header.

My problem is that I am trying to calculate this in a case where the strings in question are names of insurance companies. With that in mind, the code above strictly calculates the Levenshtein distance and is not tailored specifically to this case. To illustrate, a simple example of why this can be an issue is because the Levenshtein distance between two insurance company names can be quite small if they both share the words "insurance" and "company" (which, as you might expect, is common), even if the insurance companies have totally different names with respect to their unique words. So, I may want the function to ignore those words when comparing two strings.

I am new to VBA. Is there a way I can implement this fix in the code? As a secondary question, are there other unique issues that could arise from comparing the names of insurance companies? Thank you for the help!

Upvotes: 1

Views: 2008

Answers (2)

Vityata
Vityata

Reputation: 43585

Your whole question can be replaced by "How do I use the replace function in VBA?". In general, the algorithm in the question looked interesting, thus I have done this for you. Simply add anything in the Array() of the function, it will work (Just write in lower case the values in the array):

Public Function removeSpecificWords(s As String) As String

 Dim arr     As Variant
 Dim cnt     As Long

 arr = Array("insurance", "company", "firma", "firm", "holding")
 removeSpecificWords = s

 For cnt = LBound(arr) To UBound(arr)
  removeSpecificWords = Replace(LCase(removeSpecificWords), LCase(arr(cnt)), vbNullString)
 Next cnt

End Function

Public Sub TestMe()

    Debug.Print removeSpecificWords("InsHolding")
    Debug.Print removeSpecificWords("InsuranceInsHoldingStar")

End Sub

In your case:

    S1 = removeSpecificWords(S1)
    S2 = removeSpecificWords(S2)
    valuePhrase = LevenshteinDistance(S1, S2)

Upvotes: 1

phrebh
phrebh

Reputation: 160

When I had a similar issue in trying to remove duplicate addresses, I approached the problem the other way and used the Longest Common Substring.

Function DetermineLCS(source As String, target As String) As Double
    Dim results() As Long
    Dim sourceLen As Long
    Dim targetLen As Long
    Dim counter1 As Long
    Dim counter2 As Long

    sourceLen = Len(source)
    targetLen = Len(target)

    ReDim results(0 To sourceLen, 0 To targetLen)

    For counter1 = 1 To sourceLen
        For counter2 = 1 To targetLen
            If Mid$(source, counter1, 1) = Mid$(target, counter2, 1) Then
                results(counter1, counter2) = results(counter1 - 1, counter2 - 1) + 1
            Else
                results(counter1, counter2) = WorksheetFunction.Max(results(counter1, _
                        counter2 - 1), results(counter1 - 1, counter2))
            End If
        Next counter2
    Next counter1

    'return the percentage of the LCS to the length of the source string
    DetermineLCS = results(sourceLen, targetLen) / sourceLen
End Function

For addresses, I've found that about an 80% match gets me close to a hundred percent matches. with insurance agency names (and I used to work in the industry, so I know the problem you face), I might suggest a 90% target or even a mix of the Levenshtein Distance and LCS, minimizing the former while maximizing the latter.

Upvotes: 0

Related Questions