Ryszard Jędraszyk
Ryszard Jędraszyk

Reputation: 2412

VBA - matching 2 arrays of sorted strings, where some elements have no match - optimization

I have 2 data sets with a lot of strings, which I need to match. 1st is 1200 rows, 2nd about 800 000. I sort both sets via Excel sort called through VBA, so that they are in ascending order and therefore I can optimize searching speed significantly, by starting every next iteration of second data set one row after last match.

Unfortunately when no match is found, Exit For will never be encountered, even when strings checked against my searched term are further in alphabet (> my string). I tried to implement comparison If vData1(arrayIndex1, 1) < vData2(arrayIndex2, 1) Then (perhaps with 'Mod' check before, if doing it in each iteration would be slow), but I encounter incorrect comparison values, for example ?"µm">"zzzzz" returns true, while in data set it is as it should be, before strings starting with "a".

Is there any reliable approach to solve this problem?

Dim optimizedCounter as long, arrayIndex1 as long, arrayIndex2 as long
Dim vData1 as variant, vData2 as variant

'sort 2 data sets in Excel ascending
'assign data sets to arrays vData1 and vData2

optimizedCounter = LBound(vData2)

For arrayIndex1 = LBound(vData1) To UBound(vData1)
    For arrayIndex2 = optimizedCounter To UBound(vData2)
        If vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1) Then
            'do action when strings match
            optimizedCounter = arrayIndex2 'narrow down 2nd data set's list, arrayIndex2 + 1 if vData1 has no duplicates
            Exit For 'match has been found, exit loop and continue matching for next element in 1st data set
        End If
    Next arrayIndex2
Next arrayIndex1

EDIT:

Thanks everyone for brilliant suggestions. For now A.S.H's solution with Application.Evaluate / StrComp did the trick for me. Because I use default binary comparison vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1), where I want to preserve current speed, I can't use Option Compare Text.

For arrayIndex1 = LBound(vData1) To UBound(vData1)
    For arrayIndex2 = optimizedCounter To UBound(vData2)
        If vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1) Then
            'do action when strings match
            optimizedCounter = arrayIndex2 'narrow down 2nd data set's list, arrayIndex2 + 1 if vData1 has no duplicates
            Exit For 'match has been found, exit loop and continue matching for next element in 1st data set
        ElseIf arrayIndex2 Mod 1000 = 0 Then
            If Application.Evaluate("""" & vData2(arrayIndex2, 1) & _
            """ > """ & vData1(arrayIndex1, 1) & """") Then Exit For
            'line below can be used instead of Application.Evaluate, the same speed, easier structure
            'If StrComp(vData2(arrayIndex2, 1), vData1(arrayIndex1, 1), vbTextCompare) = 1 Then Exit For
        End If           
    Next arrayIndex2
Next arrayIndex1 

As this method takes some time, I have been forced to use it every nth number of iterations in order to receive performance gain. Depending on data set length and % of matching values, optimal mod values will be different.

As a comment to number of combinations checked, my list of search terms contains duplicates.

Vanilla code:

Execution time: 12.76

Combinations processed: 144596591

Application.Evaluate or StrComp:

Execution time: 17.30

Combinations processed: 1192341

Application.Evaluate or StrComp under condition mod 50 = 0:

Execution time: 0.48

Combinations processed: 1201717

Application.Evaluate or StrComp under condition mod 1000 = 0:

Execution time: 0.16

Combinations processed: 1376317

Increasing mod value will increase execution time from this point, due to bigger number of combinations processed.

I tried putting With Application outside the main loop and using .Evaluate, it has completely no impact on speed.

EDIT 2:

Application.Match and Application.Vlookup won't work for arrays with > 65536 rows. They do work for ranges however, as comments pointed out.

Dim vMatch as Variant, myRng as Range

'myRng is set to one-column range of values to look for, about 800K rows

For arrayIndex1 = LBound(vData1) To UBound(vData1)
        vMatch = Application.Match(vData1(arrayIndex1, 1), myRng, 0)
        If Not IsError(vMatch) Then
            'do action when strings match
        End If
Next arrayIndex1

Application.Match with MatchType = 0:

Execution time: 28.81

Number of lookups: 1200

Upvotes: 2

Views: 1048

Answers (2)

MacroMarc
MacroMarc

Reputation: 3324

I ran a little test with some binary match functions, and it runs 2 datasets 129K rows against 780K rows in about 3 seconds with 335K comparison checks. That's the power of binary search + a little tweaking.

Some modified 'Binary Search' utility functions:

Public Function wsArrayBinaryMatch( _
        ByVal val As Variant, _
        arr() As Variant, _
        ByVal searchCol As Long, _
        Optional optimalStart As Long, Optional optimalEnd As Long, Optional exactMatch As Boolean = True) As Variant

  Dim a As Long, z As Long, curr As Long

  wsArrayBinaryMatch = "Not Found in Range"
  a = IIf(optimalStart, optimalStart, LBound(arr))
  z = IIf(optimalEnd, optimalEnd, UBound(arr))

  If compare(arr(a, searchCol), val) = 1 Then
        Exit Function
  End If

  If compare(arr(a, searchCol), val) = 0 Then
        wsArrayBinaryMatch = a
        Exit Function
  End If

  If compare(arr(z, searchCol), val) = -1 Then
        Exit Function
  End If

  While z - a > 1
        curr = Round((CLng(a) + CLng(z)) / 2, 0)
        If compare(arr(curr, searchCol), val) = 0 Then
              z = curr
              wsArrayBinaryMatch = curr
        End If

        If compare(arr(curr, searchCol), val) = -1 Then
              a = curr
        Else
              z = curr
        End If
  Wend

  If compare(arr(z, searchCol), val) = 0 Then
        wsArrayBinaryMatch = z
  Else
        If Not exactMatch Then
              wsArrayBinaryMatch = a
        Else
              'approx match to val was found inside the range...
              wsArrayBinaryMatch = "ApproxIndex" & a
        End If
  End If

End Function

Public Function wsArrayBinaryLookup( _
        ByVal val As Variant, _
        arr() As Variant, _
        ByVal searchCol As Long, _
        ByVal returnCol As Long, _
        Optional exactMatch As Boolean = True) As Variant

  Dim a As Long, z As Long, curr As Long

  wsArrayBinaryLookup = CVErr(xlErrNA)
  a = LBound(arr)
  z = UBound(arr)

  If compare(arr(a, searchCol), val) = 1 Then
        Exit Function
  End If

  If compare(arr(a, searchCol), val) = 0 Then
        wsArrayBinaryLookup = arr(a, returnCol)
        Exit Function
  End If

  If compare(arr(z, searchCol), val) = -1 Then
        Exit Function
  End If

  While z - a > 1
        curr = Round((CLng(a) + CLng(z)) / 2, 0)
        If compare(arr(curr, searchCol), val) = 0 Then
              z = curr
              wsArrayBinaryLookup = arr(curr, returnCol)
        End If

        If compare(arr(curr, searchCol), val) = -1 Then
              a = curr
        Else
              z = curr
        End If
  Wend

  If compare(arr(z, searchCol), val) = 0 Then
        wsArrayBinaryLookup = arr(z, returnCol)
  Else
        If Not exactMatch Then
              wsArrayBinaryLookup = arr(a, returnCol)
        End If
  End If

End Function

Public Function compare(ByVal x As Variant, ByVal y As Variant) As Long

  If IsNumeric(x) And IsNumeric(y) Then
        Select Case x - y
              Case Is = 0
                    compare = 0
              Case Is > 0
                    compare = 1
              Case Is < 0
                    compare = -1
        End Select
  Else
        If TypeName(x) = "String" And TypeName(y) = "String" Then
              compare = StrComp(x, y, vbTextCompare)
        End If
  End If

End Function

I then wrote a sub(can convert to function) that tried to make best use of the sorted data, and improve efficiency of restricting the search range. This involves alternating between trying to find low and high items in the first dataset.

Note that both datasets just had 2 columns each, and that it was searching for a match from 1st column of each. If match was found, then it set the value of the second column in the 1st set.

The way to return the approx match in a string is a bit hacky, but I was tired...

Sub BinaryMatchInSortedSets()

  Dim set1() As Variant, set2() As Variant
  set1 = Sheet1.Range("E2:F129601").Value  '129K rows of strings and column F says 'Default'
  set2 = Sheet1.Range("I2:J780001").Value  '780K rows of strings and numbers

  Dim low1 As Long, high1 As Long
  Dim low2 As Long, high2 As Long
  Dim indexToFind As Long, approxIndex As Long
  low1 = LBound(set1)
  high1 = UBound(set1)
  low2 = LBound(set2)
  high2 = UBound(set2)

  Dim errString As String
  Dim matchIndex As Variant
  Dim searchingForLow As Boolean: searchingForLow = True
  While low1 <= high1 And low2 < high2
        indexToFind = IIf(searchingForLow, low1, high1)

        matchIndex = wsArrayBinaryMatch(set1(indexToFind, 1), set2, 1, low2, high2, True)

        If IsNumeric(matchIndex) Then
              'match found
              low2 = IIf(searchingForLow, matchIndex, low2)
              high2 = IIf(searchingForLow, high2, matchIndex)
              'do all other stuff in here that needs doing when match is found...
              set1(indexToFind, 2) = set2(matchIndex, 2)  'Just an example of what you could do
        Else
              'no match, so set up efficient search range if possible
              If Left(matchIndex, 11) = "ApproxIndex" Then
                    approxIndex = Mid(matchIndex, 12)
                    If searchingForLow Then
                          low2 = approxIndex + 1
                    Else
                          high2 = approxIndex - 1
                    End If
              End If
        End If

        If searchingForLow Then
              low1 = low1 + 1
        Else
              high1 = high1 - 1
        End If
        searchingForLow = Not searchingForLow
  Wend
  Sheet1.Range("L2").Resize(UBound(set1) - LBound(set1) + 1, 2).Value = set1

End Sub

Upvotes: 1

A.S.H
A.S.H

Reputation: 29332

If vData1(arrayIndex1, 1) < vData2(arrayIndex2, 1) Then ... I encounter incorrect comparison values, for example ?"µm">"zzzzz" returns true, while in data set it is as it should be, before strings starting with "a".

Indeed, if the string comparison operation is different in the prior sorting and in your code, the prior sorting becomes useless. And this occurred because

Comparison in VBA is binary by default

?"µm">"zzzzz"                              --->   True
?Application.Evaluate("""µm"">""zzzzz""")  --->   False
?StrComp("µm", "zzzzz")                    --->    1 
?StrComp("µm", "zzzzz", vbTextCompare)     --->   -1 
                        ^^^^^^^^^^^^^^

p.s. unless you set Option Compare Text or strComp as pointed out in comments, or by using Excel's comparison:

If Application.Evaluate("""" & vData1(arrayIndex1, 1) & _ 
    """ < """ & vData2(arrayIndex2, 1) & """") Then

This will solve the issue of the comparison mismatch. Indeed stopping your loop based on the < comparison should make it much faster. Whether this is the best possible algorithm is another debate. Your arrays being sorted, binary search should be a perfect candidate.

Unless you go for binary search, consider using Excel's built-in functions, i.e. Application.VLookup or Application.Match, they are almost an order of magnitude faster than VBA loops, even if the latters are working on prefetched arrays.

Upvotes: 2

Related Questions