Reputation: 2412
I have 2 data sets with a lot of strings, which I need to match. 1st is 1200 rows, 2nd about 800 000. I sort both sets via Excel sort called through VBA, so that they are in ascending order and therefore I can optimize searching speed significantly, by starting every next iteration of second data set one row after last match.
Unfortunately when no match is found, Exit For
will never be encountered, even when strings checked against my searched term are further in alphabet (> my string). I tried to implement comparison If vData1(arrayIndex1, 1) < vData2(arrayIndex2, 1) Then
(perhaps with 'Mod' check before, if doing it in each iteration would be slow), but I encounter incorrect comparison values, for example ?"µm">"zzzzz"
returns true, while in data set it is as it should be, before strings starting with "a".
Is there any reliable approach to solve this problem?
Dim optimizedCounter as long, arrayIndex1 as long, arrayIndex2 as long
Dim vData1 as variant, vData2 as variant
'sort 2 data sets in Excel ascending
'assign data sets to arrays vData1 and vData2
optimizedCounter = LBound(vData2)
For arrayIndex1 = LBound(vData1) To UBound(vData1)
For arrayIndex2 = optimizedCounter To UBound(vData2)
If vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1) Then
'do action when strings match
optimizedCounter = arrayIndex2 'narrow down 2nd data set's list, arrayIndex2 + 1 if vData1 has no duplicates
Exit For 'match has been found, exit loop and continue matching for next element in 1st data set
End If
Next arrayIndex2
Next arrayIndex1
EDIT:
Thanks everyone for brilliant suggestions. For now A.S.H's solution with Application.Evaluate
/ StrComp
did the trick for me. Because I use default binary comparison vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1)
, where I want to preserve current speed, I can't use Option Compare Text.
For arrayIndex1 = LBound(vData1) To UBound(vData1)
For arrayIndex2 = optimizedCounter To UBound(vData2)
If vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1) Then
'do action when strings match
optimizedCounter = arrayIndex2 'narrow down 2nd data set's list, arrayIndex2 + 1 if vData1 has no duplicates
Exit For 'match has been found, exit loop and continue matching for next element in 1st data set
ElseIf arrayIndex2 Mod 1000 = 0 Then
If Application.Evaluate("""" & vData2(arrayIndex2, 1) & _
""" > """ & vData1(arrayIndex1, 1) & """") Then Exit For
'line below can be used instead of Application.Evaluate, the same speed, easier structure
'If StrComp(vData2(arrayIndex2, 1), vData1(arrayIndex1, 1), vbTextCompare) = 1 Then Exit For
End If
Next arrayIndex2
Next arrayIndex1
As this method takes some time, I have been forced to use it every nth number of iterations in order to receive performance gain. Depending on data set length and % of matching values, optimal mod values will be different.
As a comment to number of combinations checked, my list of search terms contains duplicates.
Vanilla code:
Execution time: 12.76
Combinations processed: 144596591
Application.Evaluate or StrComp:
Execution time: 17.30
Combinations processed: 1192341
Application.Evaluate or StrComp under condition mod 50 = 0:
Execution time: 0.48
Combinations processed: 1201717
Application.Evaluate or StrComp under condition mod 1000 = 0:
Execution time: 0.16
Combinations processed: 1376317
Increasing mod value will increase execution time from this point, due to bigger number of combinations processed.
I tried putting With Application
outside the main loop and using .Evaluate, it has completely no impact on speed.
EDIT 2:
Application.Match
and Application.Vlookup
won't work for arrays with > 65536 rows. They do work for ranges however, as comments pointed out.
Dim vMatch as Variant, myRng as Range
'myRng is set to one-column range of values to look for, about 800K rows
For arrayIndex1 = LBound(vData1) To UBound(vData1)
vMatch = Application.Match(vData1(arrayIndex1, 1), myRng, 0)
If Not IsError(vMatch) Then
'do action when strings match
End If
Next arrayIndex1
Application.Match with MatchType = 0:
Execution time: 28.81
Number of lookups: 1200
Upvotes: 2
Views: 1048
Reputation: 3324
I ran a little test with some binary match functions, and it runs 2 datasets 129K rows against 780K rows in about 3 seconds with 335K comparison checks. That's the power of binary search + a little tweaking.
Some modified 'Binary Search' utility functions:
Public Function wsArrayBinaryMatch( _
ByVal val As Variant, _
arr() As Variant, _
ByVal searchCol As Long, _
Optional optimalStart As Long, Optional optimalEnd As Long, Optional exactMatch As Boolean = True) As Variant
Dim a As Long, z As Long, curr As Long
wsArrayBinaryMatch = "Not Found in Range"
a = IIf(optimalStart, optimalStart, LBound(arr))
z = IIf(optimalEnd, optimalEnd, UBound(arr))
If compare(arr(a, searchCol), val) = 1 Then
Exit Function
End If
If compare(arr(a, searchCol), val) = 0 Then
wsArrayBinaryMatch = a
Exit Function
End If
If compare(arr(z, searchCol), val) = -1 Then
Exit Function
End If
While z - a > 1
curr = Round((CLng(a) + CLng(z)) / 2, 0)
If compare(arr(curr, searchCol), val) = 0 Then
z = curr
wsArrayBinaryMatch = curr
End If
If compare(arr(curr, searchCol), val) = -1 Then
a = curr
Else
z = curr
End If
Wend
If compare(arr(z, searchCol), val) = 0 Then
wsArrayBinaryMatch = z
Else
If Not exactMatch Then
wsArrayBinaryMatch = a
Else
'approx match to val was found inside the range...
wsArrayBinaryMatch = "ApproxIndex" & a
End If
End If
End Function
Public Function wsArrayBinaryLookup( _
ByVal val As Variant, _
arr() As Variant, _
ByVal searchCol As Long, _
ByVal returnCol As Long, _
Optional exactMatch As Boolean = True) As Variant
Dim a As Long, z As Long, curr As Long
wsArrayBinaryLookup = CVErr(xlErrNA)
a = LBound(arr)
z = UBound(arr)
If compare(arr(a, searchCol), val) = 1 Then
Exit Function
End If
If compare(arr(a, searchCol), val) = 0 Then
wsArrayBinaryLookup = arr(a, returnCol)
Exit Function
End If
If compare(arr(z, searchCol), val) = -1 Then
Exit Function
End If
While z - a > 1
curr = Round((CLng(a) + CLng(z)) / 2, 0)
If compare(arr(curr, searchCol), val) = 0 Then
z = curr
wsArrayBinaryLookup = arr(curr, returnCol)
End If
If compare(arr(curr, searchCol), val) = -1 Then
a = curr
Else
z = curr
End If
Wend
If compare(arr(z, searchCol), val) = 0 Then
wsArrayBinaryLookup = arr(z, returnCol)
Else
If Not exactMatch Then
wsArrayBinaryLookup = arr(a, returnCol)
End If
End If
End Function
Public Function compare(ByVal x As Variant, ByVal y As Variant) As Long
If IsNumeric(x) And IsNumeric(y) Then
Select Case x - y
Case Is = 0
compare = 0
Case Is > 0
compare = 1
Case Is < 0
compare = -1
End Select
Else
If TypeName(x) = "String" And TypeName(y) = "String" Then
compare = StrComp(x, y, vbTextCompare)
End If
End If
End Function
I then wrote a sub(can convert to function) that tried to make best use of the sorted data, and improve efficiency of restricting the search range. This involves alternating between trying to find low and high items in the first dataset.
Note that both datasets just had 2 columns each, and that it was searching for a match from 1st column of each. If match was found, then it set the value of the second column in the 1st set.
The way to return the approx match in a string is a bit hacky, but I was tired...
Sub BinaryMatchInSortedSets()
Dim set1() As Variant, set2() As Variant
set1 = Sheet1.Range("E2:F129601").Value '129K rows of strings and column F says 'Default'
set2 = Sheet1.Range("I2:J780001").Value '780K rows of strings and numbers
Dim low1 As Long, high1 As Long
Dim low2 As Long, high2 As Long
Dim indexToFind As Long, approxIndex As Long
low1 = LBound(set1)
high1 = UBound(set1)
low2 = LBound(set2)
high2 = UBound(set2)
Dim errString As String
Dim matchIndex As Variant
Dim searchingForLow As Boolean: searchingForLow = True
While low1 <= high1 And low2 < high2
indexToFind = IIf(searchingForLow, low1, high1)
matchIndex = wsArrayBinaryMatch(set1(indexToFind, 1), set2, 1, low2, high2, True)
If IsNumeric(matchIndex) Then
'match found
low2 = IIf(searchingForLow, matchIndex, low2)
high2 = IIf(searchingForLow, high2, matchIndex)
'do all other stuff in here that needs doing when match is found...
set1(indexToFind, 2) = set2(matchIndex, 2) 'Just an example of what you could do
Else
'no match, so set up efficient search range if possible
If Left(matchIndex, 11) = "ApproxIndex" Then
approxIndex = Mid(matchIndex, 12)
If searchingForLow Then
low2 = approxIndex + 1
Else
high2 = approxIndex - 1
End If
End If
End If
If searchingForLow Then
low1 = low1 + 1
Else
high1 = high1 - 1
End If
searchingForLow = Not searchingForLow
Wend
Sheet1.Range("L2").Resize(UBound(set1) - LBound(set1) + 1, 2).Value = set1
End Sub
Upvotes: 1
Reputation: 29332
If vData1(arrayIndex1, 1) < vData2(arrayIndex2, 1) Then
... I encounter incorrect comparison values, for example?"µm">"zzzzz"
returns true, while in data set it is as it should be, before strings starting with "a".
Indeed, if the string comparison operation is different in the prior sorting and in your code, the prior sorting becomes useless. And this occurred because
Comparison in VBA is binary by default
?"µm">"zzzzz" ---> True
?Application.Evaluate("""µm"">""zzzzz""") ---> False
?StrComp("µm", "zzzzz") ---> 1
?StrComp("µm", "zzzzz", vbTextCompare) ---> -1
^^^^^^^^^^^^^^
p.s. unless you set Option Compare Text
or strComp
as pointed out in comments, or by using Excel's comparison:
If Application.Evaluate("""" & vData1(arrayIndex1, 1) & _
""" < """ & vData2(arrayIndex2, 1) & """") Then
This will solve the issue of the comparison mismatch. Indeed stopping your loop based on the <
comparison should make it much faster. Whether this is the best possible algorithm is another debate. Your arrays being sorted, binary search should be a perfect candidate.
Unless you go for binary search, consider using Excel's built-in functions, i.e. Application.VLookup
or Application.Match
, they are almost an order of magnitude faster than VBA loops, even if the latters are working on prefetched arrays.
Upvotes: 2