Reputation: 15
I have a function in Excel that calculates the Levenshtein Distance between two strings (the number of insertions, deletions, and/or substitutions needed to transform one string into another). I am using this as part of a project I'm working on that involves "fuzzy string matching."
Below you will see the code for the LevenshteinDistance function and a valuePhrase function. The latter exists for the purposes of executing the function in my spreadsheet. I have taken this from what I read in this thread.
'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)`
Public Function LevenshteinDistance(ByRef S1 As String, ByVal S2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, cost As Long 'loop counters and cost of
'substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and
Substitution
L1 = Len(S1): L2 = Len(S2)
ReDim D(0 To L1, 0 To L2)
For i = 0 To L1: D(i, 0) = i: Next i
For j = 0 To L2: D(0, j) = j: Next j
For j = 1 To L2
For i = 1 To L1
cost = Abs(StrComp(Mid$(S1, i, 1), Mid$(S2, j, 1), vbTextCompare))
cI = D(i - 1, j) + 1
cD = D(i, j - 1) + 1
cS = D(i - 1, j - 1) + cost
If cI <= cD Then 'Insertion or Substitution
If cI <= cS Then D(i, j) = cI Else D(i, j) = cS
Else 'Deletion or Substitution
If cD <= cS Then D(i, j) = cD Else D(i, j) = cS
End If
Next i
Next j
LevenshteinDistance = D(L1, L2)
End Function
Public Function valuePhrase#(ByRef S1$, ByRef S2$)
valuePhrase = LevenshteinDistance(S1, S2)
End Function
I am executing this valuePhrase function in a table in one of my sheets where the column and row headers are names of insurance companies. Ideally, the smallest number in any given row (the shortest Levenshtein distance) should correspond to a column header with the name of the insurance company in the table that most closely matches the name of that insurance company in the row header.
My problem is that I am trying to calculate this in a case where the strings in question are names of insurance companies. With that in mind, the code above strictly calculates the Levenshtein distance and is not tailored specifically to this case. To illustrate, a simple example of why this can be an issue is because the Levenshtein distance between two insurance company names can be quite small if they both share the words "insurance" and "company" (which, as you might expect, is common), even if the insurance companies have totally different names with respect to their unique words. So, I may want the function to ignore those words when comparing two strings.
I am new to VBA. Is there a way I can implement this fix in the code? As a secondary question, are there other unique issues that could arise from comparing the names of insurance companies? Thank you for the help!
Upvotes: 1
Views: 2008
Reputation: 43585
Your whole question can be replaced by "How do I use the replace function in VBA?". In general, the algorithm in the question looked interesting, thus I have done this for you. Simply add anything in the Array()
of the function, it will work (Just write in lower case the values in the array):
Public Function removeSpecificWords(s As String) As String
Dim arr As Variant
Dim cnt As Long
arr = Array("insurance", "company", "firma", "firm", "holding")
removeSpecificWords = s
For cnt = LBound(arr) To UBound(arr)
removeSpecificWords = Replace(LCase(removeSpecificWords), LCase(arr(cnt)), vbNullString)
Next cnt
End Function
Public Sub TestMe()
Debug.Print removeSpecificWords("InsHolding")
Debug.Print removeSpecificWords("InsuranceInsHoldingStar")
End Sub
In your case:
S1 = removeSpecificWords(S1)
S2 = removeSpecificWords(S2)
valuePhrase = LevenshteinDistance(S1, S2)
Upvotes: 1
Reputation: 160
When I had a similar issue in trying to remove duplicate addresses, I approached the problem the other way and used the Longest Common Substring.
Function DetermineLCS(source As String, target As String) As Double
Dim results() As Long
Dim sourceLen As Long
Dim targetLen As Long
Dim counter1 As Long
Dim counter2 As Long
sourceLen = Len(source)
targetLen = Len(target)
ReDim results(0 To sourceLen, 0 To targetLen)
For counter1 = 1 To sourceLen
For counter2 = 1 To targetLen
If Mid$(source, counter1, 1) = Mid$(target, counter2, 1) Then
results(counter1, counter2) = results(counter1 - 1, counter2 - 1) + 1
Else
results(counter1, counter2) = WorksheetFunction.Max(results(counter1, _
counter2 - 1), results(counter1 - 1, counter2))
End If
Next counter2
Next counter1
'return the percentage of the LCS to the length of the source string
DetermineLCS = results(sourceLen, targetLen) / sourceLen
End Function
For addresses, I've found that about an 80% match gets me close to a hundred percent matches. with insurance agency names (and I used to work in the industry, so I know the problem you face), I might suggest a 90% target or even a mix of the Levenshtein Distance and LCS, minimizing the former while maximizing the latter.
Upvotes: 0