Entelechy
Entelechy

Reputation: 45

VBA Speeding Up Index-Match for Large Dataset

I'm writing some code to reformat some datasets. At the end, I have some empty columns in my output dataset, which I need to populate by looking up values from another sheet (Index-Match effectively).

I have tried many approaches, and two were successful. The problem is, they are far too slow, and unfortunately I am limited to using VBA. To prefix, I'm aware this problem could be solved by simply writing the Index-Match formula into the cells manually at the end and dragging them down, but I am trying to make life as simple as possible for the people who will need to perform this task monthly (also to reduce room for error).

Background info:

ws_ps example Column B as lookup value, with Column E to populate if a row in Column B is not empty

ws_priips example Column G corresponds to the lookup value in ws_ps Column B, with the value to look for coming from Column E

Approach 1: Looping through each row and using WorksheetFunction - 12 min runtime

    On Error Resume Next
    last_row_ps = ws_ps.UsedRange.Rows.Count
    For ps_row = 2 To last_row_ps
        ws_ps.Cells(ps_row, 5).Value = WorksheetFunction.IfError(WorksheetFunction.Index(ws_priips.Range("A:G"), _
        WorksheetFunction.Match(ws_ps.Cells(ps_row, 2), ws_priips.Range("G:G"), 0), 5), "")
    Next ps_row
    On Error GoTo -1

Approach 2: Loading worksheets into arrays and writing to worksheet if conditions are met - 15min runtime

    last_row_ps = ws_ps.UsedRange.Rows.Count
    last_row_priips = ws_priips.UsedRange.Rows.Count
    ps_array = ws_ps.Range("A1:X" & last_row_ps).Value
    priips_array = ws_priips.Range("A1:AZ" & last_row_priips).Value
    
    For ps_row = 2 To UBound(ps_array, 1)
        For priips_row = 2 To UBound(priips_array, 1)
            If ps_array(ps_row, 2) <> "" Then
                If ps_array(ps_row, 2) = priips_array(priips_row, 7) Then
                    ws_ps.Cells(ps_row, 5).Value = priips_array(priips_row, 5)
                    GoTo SkipLoop
                End If
            Else
                GoTo SkipLoop
            End If
        Next priips_row
SkipLoop:
    Next ps_row

Am I out of luck? So far I've only implemented these solutions for 1 column, but I will need to apply it to around 10. Is there any way to drastically speed things up, without resorting to typing the functions into the worksheet or using python? I would be grateful for any pointers. I am still very much a beginner.

Upvotes: 4

Views: 183

Answers (4)

rotabor
rotabor

Reputation: 4528

Another way which can help is the concurrent loop approach (no external library, suitable for Mac):

Option Explicit

Sub Way2()
  Dim arrDest, arrSrc, ws As Worksheet, rngDest As Range, d&, s&, r&
  Application.ScreenUpdating = False
  Set ws = ThisWorkbook.ActiveSheet
  Set rngDest = ws.Range("A1").CurrentRegion
  r = rngDest.Rows.Count
  [B2].Resize(r - 1, 1) = Application.Evaluate("SEQUENCE(" & (r - 1) & ")")
  rngDest.Sort Key1:=[A:A], Header:=xlYes
  arrDest = rngDest.Value
  arrSrc = ws.Range("I1").CurrentRegion.Value
  s = 2
  For d = 2 To r
    While arrDest(d, 1) <> arrSrc(s, 1)
      s = s + 1
    Wend
    arrDest(d, 3) = arrSrc(s, 2)
    arrDest(d, 5) = arrSrc(s, 3)
  Next
  rngDest.Value = arrDest
  rngDest.Sort Key1:=[B:B], Header:=xlYes
  [B2].Resize(r - 1, 1).ClearContents
  Application.ScreenUpdating = True
End Sub

enter image description here

I tested it on the 50000 reference table and the 300000 data table. Despite additional operations, this method is only 15% slower than using a dictionary. It's expected that the reference table is ordered (which is typical), otherwise it can be temporary sorted same as the data table.

This method can be helpful if a dictionary is not applicable, e. g. if range comparison is required.

Limitation: the data table should contain only eligible keys. But it can be resolved.

To measure the execution time, use the next addition for a sub:

Public Declare PtrSafe Function GetTickCount Lib "kernel32.dll" () As Long

Sub Anysub()
  Dim t As Long
  t = GetTickCount

  ' payload code here

  Debug.Print GetTickCount - t
End Sub

The subroutine for the data table without key integrity:

Sub Way2_NoIntegrity()
  Dim arrDest, arrSrc, ws As Worksheet, d&, s&, r&, m&, p&, rngDest As Range
  Application.ScreenUpdating = False
  Set ws = ThisWorkbook.ActiveSheet
  Set rngDest = ws.Range("A1").CurrentRegion
  r = rngDest.Rows.Count
  [B2].Resize(r - 1, 1) = Application.Evaluate("SEQUENCE(" & (r - 1) & ")")
  rngDest.Sort Key1:=[A:A], Header:=xlYes
  arrDest = rngDest.Value
  arrSrc = ws.Range("I1").CurrentRegion.Value
  p = UBound(arrSrc, 1)
  s = 2: m = 2
  For d = 2 To r
    While arrDest(d, 1) <> arrSrc(s, 1)
      s = s + 1
      If s > p Then
       d = d + 1
       If d > r Then Exit For
       s = m
      End If
    Wend
    m = s
    arrDest(d, 3) = arrSrc(s, 2)
    arrDest(d, 5) = arrSrc(s, 3)
  Next
  rngDest.Value = arrDest
  rngDest.Sort Key1:=[B:B], Header:=xlYes
  [B2].Resize(r - 1, 1).ClearContents
  Application.ScreenUpdating = True
End Sub

The subroutine for range comparison:

Sub Way2RangeComp()
  Dim arrDest, arrSrc, ws As Worksheet, d&, s&, r&, m&, p&, rngDest As Range
  Application.ScreenUpdating = False
  Set ws = ThisWorkbook.ActiveSheet
  Set rngDest = ws.Range("A1").CurrentRegion
  r = rngDest.Rows.Count
  [B2].Resize(r - 1, 1) = Application.Evaluate("SEQUENCE(" & (r - 1) & ")")
  rngDest.Sort Key1:=[A:A], Header:=xlYes
  arrDest = rngDest.Value
  arrSrc = ws.[I1].CurrentRegion.Value
  p = UBound(arrSrc, 1)
  s = 2: m = 2
  For d = 2 To r
    While arrDest(d, 1) > arrSrc(s, 1)
      s = s + 1
      If s > p Then
       d = d + 1
       If d > r Then Exit For
       s = m
      End If
    Wend
    m = s
    arrDest(d, 3) = arrSrc(s, 2)
    arrDest(d, 5) = arrSrc(s, 3)
  Next
  rngDest.Value = arrDest
  rngDest.Sort Key1:=[B:B], Header:=xlYes
  [B2].Resize(r - 1, 1).ClearContents
  Application.ScreenUpdating = True
End Sub

enter image description here

More efficient (20% better then with a dictionary!) algorithm for the data table without key integrity (of course, the case with the full key integrity is included):

Sub Way2_NoIntegrity_Advanced()
  Dim arrDest, arrSrc, arrInd, ws As Worksheet, rngDest As Range
  Dim d&, s&, r&, m&, p&, ind&
  Application.ScreenUpdating = False
  Set ws = ThisWorkbook.ActiveSheet : Set rngDest = Range(ws.[E2], ws.[A2].End(xlDown))
  arrDest = rngDest.Value : r = UBound(arrDest, 1)
  arrInd = Application.SortBy(Application.Evaluate("SEQUENCE(" & r & ")"), _
    rngDest.Columns(1).Value)
  arrSrc = ws.Range("I1").CurrentRegion.Value : p = UBound(arrSrc, 1) : s = 2: m = 2
  For d = 1 To r
    ind = arrInd(d, 1)
    While arrDest(ind, 1) <> arrSrc(s, 1)
      s = s + 1
      If s > p Then
       d = d + 1: ind = arrInd(d, 1)
       If d > r Then Exit For
       s = m
      End If
    Wend
    m = s : arrDest(ind, 3) = arrSrc(s, 2) : arrDest(ind, 5) = arrSrc(s, 3)
  Next
  rngDest.Value = arrDest
  Application.ScreenUpdating = True
End Sub

Upvotes: 0

Tim Williams
Tim Williams

Reputation: 166306

Read the data from the lookup sheet into a 2D array, then build a dictionary object with the ISIN value as the key, and the array "row" number as the value. Use the dictionary to extract matched row values from the array to populate to the sheet being filled. Faster if you first pull the "to be filled" and ISIN columns from the output dataset sheet, fill them using the dictionary and the lookup array, then at the end write them back to the output sheet. Basically try to avoid any cell-by-cell operations.

Using the code below, with both tables containing 150k rows (second table sorted randomly), I filled two columns in the first table in 3-4 sec.

Note if you need cross-platform support you can use a Collection in place of the dictionary, which is not available on Mac.

Sub Tester()

    Dim arrDest, arrSrc, ws As Worksheet, isin As String
    Dim dict As Object, r As Long, rMatch As Long
    Dim rngDest As Range
    
    'both my tables are on one sheet for testing
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    'table to be filled
    Set rngDest = ws.Range("A1").CurrentRegion
    arrDest = rngDest.Value 'read to 2D array
    
    'data used as lookup table, as an array
    arrSrc = ws.Range("I1").CurrentRegion.Value

    Set dict = CreateObject("scripting.dictionary")
    'map isin to row number in lookup table array
    For r = 2 To UBound(arrSrc, 1)
        isin = arrSrc(r, 1)
        If Len(isin) > 0 Then dict(isin) = r 'assumes no duplicates in lookup table
    Next r
    
    'loop the report table and try to match values in the lookup table
    For r = 2 To UBound(arrDest, 1)
        isin = arrDest(r, 1)
        If Len(isin) > 0 Then
            If dict.Exists(isin) Then 'have a match?
                rMatch = dict(isin) 'matched row
                arrDest(r, 3) = arrSrc(rMatch, 2) 'copy a couple of values
                arrDest(r, 5) = arrSrc(rMatch, 3)
            End If 'have match
        End If     'not zero-length
    Next r
    
    rngDest.Value = arrDest 'replace data with updated array

End Sub

My test tables (shorter version):
test data

Added: for anyone who needs cross-platform support (Windows/Mac) here's the basic "row map" approach using a Collection in place of a Dictionary. Loading the collection with 150k items takes about 10% longer than loading a dictionary of the same size, but retrieval by key is actually about 5x faster using a collection. Also seems like the Collection performance scales better for larger numbers of keys - for 400k items load and read is approx. 2.2 and 0.7 sec but for dictionary it's approx. 10.5 and 10 sec (much slower)

'Using a Collection like a dictionary for mapping column values
'   to their position in a dataset
Sub CollectionTest()

    Dim arr, r As Long, col As Collection, t, k As String, v
    
    Set col = New Collection
    
    'Source range A1:A150000 filled with "Val_000001", "Val_000002", etc
    '   sorted randomly
    arr = Range("A1").CurrentRegion.Value
    
    t = Timer
    For r = 1 To UBound(arr, 1)
        k = arr(r, 1)
        'Add the row number as value and cell content as key
        '  note your row keys should be unique
        If IsEmpty(KeyValue(col, k)) Then col.Add r, k
    Next r
    Debug.Print col.Count & " items"
    Debug.Print "Loaded row map in " & Timer - t
    
    t = Timer
    For r = 1 To 150000
        k = "Val_" & Format(r, "000000")
        v = KeyValue(col, k)
        
        If r < 5 Then Debug.Print "Key " & k & " at row# " & v
    
    Next r
    Debug.Print "Retrieved values in " & Timer - t

End Sub

'Retrieve value for key `k` from collection `col`
'  Returns Empty if there's no such key
Function KeyValue(col As Collection, k As String)
    On Error Resume Next 'ignore error if no match for `k`
    KeyValue = col.Item(k)
End Function

Upvotes: 6

TinMan
TinMan

Reputation: 7759

Dictionary are the best choose for matching key values.

Read: Analyst Cave - Excel VLOOKUP vs INDEX MATCH vs SQL vs VBA

Watch: Excel VBA Introduction Part 39 - Dictionaries

Sub UpdateWSPD(ws_ps As Worksheet, ws_priips As Worksheet)
    Const priips_IDColumn As Long = 7
    Const ps_IDColumn As Long = 2
    Const priips_ValueColumn As Long = 5
    Const ps_ValueColumn As Long = 5
    
    Const KeyPrefix As String = "Key"
    
    Dim psData As Variant, priipsData As Variant
    psData = ws_ps.Range("A1").CurrentRegion.Value
    priipsData = ws_priips.Range("A1").CurrentRegion.Value
    
    Dim Map As Object
    Set Map = CreateObject("Scripting.Dictionary")
    
    Dim Key As String
    Dim r As Long
    For r = 2 To UBound(priipsData)
        Key = KeyPrefix & priipsData(r, priips_IDColumn)
        Map(Key) = r
    Next
    
    Dim priipsRow As Long
    
    For r = 2 To UBound(ws_ps)
        Key = KeyPrefix & ws_ps(r, ps_IDColumn)
        If Map.Exists(Key) Then
            priipsRow = Map(Key)
            ws_ps(r, ps_ValueColumn) = priipsData(priipsRow, priips_ValueColumn)
        End If
    Next
        
    Application.ScreenUpdating = False
    ws_ps.Range("A1").CurrentRegion.Value = ws_ps
    Application.ScreenUpdating = True

End Sub

Upvotes: 4

Michal
Michal

Reputation: 5848

Is Power Query an option? Untested so you might need to adjust a thing or two:

let
    // Load 'ws_ps' and 'ws_priips' queries
    ps_query = Excel.CurrentWorkbook(){[Name="ws_ps"]}[Content],
    priips_query = Excel.CurrentWorkbook(){[Name="ws_priips"]}[Content],

    // Merge 'ps_query' with 'priips_query' using Column B and Column G
    merged_table = Table.NestedJoin(ps_query, "Column B", priips_query, "Column G", "priips_data", JoinKind.LeftOuter),

    // Expand the 5th column (E) from the merged data
    expanded_table = Table.ExpandTableColumn(merged_table, "priips_data", {"Column E"}),

    // Replace null values with an empty string
    result_table = Table.ReplaceValue(expanded_table, null, "", Replacer.ReplaceValue, {"Column E"})
in
    result_table

Upvotes: 1

Related Questions