user4039065
user4039065

Reputation:

Replace worksheet array formula with VBA memory array

My worksheet requires the following array formula in BG2.

=INDEX('Client'!O$2:O$347473,
       MATCH(1, (('Client_Cost'!D$2:D$347473='Client'!BC2)*
                 ('Client_Cost'!E$2:E$347473='Client'!BE2)), 0))

This provides a two-column match (Client_Cost!D:D to Client!BC2 AND Client_Cost!E:E to Client!BE2) and returns the corresponding value from Client!O:O.

The large number of rows makes the array formula very calculation-intensive. I can deal with a few hundred rows (~90 seconds for 500 rows) but I need results all the way down to Client'!BG347473 and I would like them sometime this year.

I've tried using Application Evaluate to return a result from the array formula into an variant array and subsequently returning the array of results to the worksheet en masse but it isn't the improvement I was hoping for. Looking for alternatives.

Upvotes: 4

Views: 649

Answers (1)

user4039065
user4039065

Reputation:

First off, I would recommend developing alternative methods with a smaller data set. 5K or 10K rows will either demonstrate a noticeable improvement or not; you can always expand to the original data set once you are confident you are not going to get into a long 'Not responding' state that you have to crash out of.

A common method of removing the array from that style of array formula¹ is a 'helper' column that concatenates the two values from column D and E in the Client_Cost worksheet into a single, delimited value. For example, in Client_Cost!Z2 as,

=CONCATENATE(Client_Cost!D2, "|", Client_Cost!E2)

Filled down to Client_Cost!Z347473 should only take a second or two.

Once that is set up, a single INDEX/MATCH function pair can provide a vastly more efficient lookup on a similarly concatenated Client!BC2 and Client'!BE2. In Client!BG2 as,

=INDEX(Client!O$2:O$347473,
   MATCH(CONCATENATE(Client!BC2, "|", Client!BE2), 
         Client_Cost'!Z$2:Z$347473, 0))

That will take 1 hr, 51 minutes for 350K rows. While not yet optimal, it is a big improvement over the estimated ~17.5 hours that the original took.

The next logical step in optimizing that method would be working with a VBA Scripting.Dictionary object. A dictionary holds its own unique index on its keys and the concatenated values could be stuffed into a dictionary object to facilitate virtually instantaneous lookups on a large number of items (i.e. rows).

Sub JR_CSE_in_Array()
    Dim olr As Long, rws As Long, JR_Count As Long, JR_Values As Variant
    Dim v As Long, vTMP As Variant, vTMPs As Variant, dVALs As Object

    Debug.Print Timer
    Set dVALs = CreateObject("Scripting.Dictionary")

    'get some dimensions to the various data ranges
    With Worksheets("Client_Cost")
        'only use as many rows as absolutely necessary
        olr = Application.Min(.Cells(Rows.Count, "D").End(xlUp).Row, _
                              .Cells(Rows.Count, "E").End(xlUp).Row)
        'store D & E
        vTMPs = .Range(.Cells(2, 4), .Cells(olr, 5)).Value2

    End With
    With Worksheets("Client")
        rws = Application.Min(.Cells(Rows.Count, "BC").End(xlUp).Row, _
                              .Cells(Rows.Count, "BE").End(xlUp).Row, _
                              UBound(vTMPs, 1))
        'override the above statement for sampling
        'rws = 5000

        'building the Dictionary object takes a fair bit of time but it is worth it
        vTMP = .Range(.Cells(2, 15), .Cells(olr, 15)).Value2
        For v = LBound(vTMPs, 1) To UBound(vTMPs, 1)
            If Not dVALs.Exists(Join(Array(vTMPs(v, 1), vTMPs(v, 2)), ChrW(8203))) Then _
                dVALs.Add Key:=Join(Array(vTMPs(v, 1), vTMPs(v, 2)), ChrW(8203)), Item:=vTMP(v, 1)
        Next v

        'store BC and BE
        vTMPs = .Range(.Cells(2, 55), .Cells(olr, 57)).Value2
    End With

    ReDim JR_Values(1 To rws, 1 To 1)   'force a two-dimension, one-based index on the array
    'Debug.Print LBound(JR_Values) & ":" & UBound(JR_Values)

    For JR_Count = LBound(JR_Values, 1) To UBound(JR_Values, 1) Step 1
        If dVALs.Exists(Join(Array(vTMPs(JR_Count, 1), vTMPs(JR_Count, 3)), ChrW(8203))) Then
            JR_Values(JR_Count, 1) = dVALs.Item(Join(Array(vTMPs(JR_Count, 1), vTMPs(JR_Count, 3)), ChrW(8203)))
        End If
    Next JR_Count

    With Worksheets("Client")
        .Range("BG2").Resize(UBound(JR_Values), 1) = JR_Values
    End With

    'Debug.Print dVALs.Count
    dVALs.RemoveAll: Set dVALs = Nothing
    Debug.Print Timer
End Sub

The elapsed time for that routine to run (without helper column(s)) was 45.72 seconds. Breaking it down, it took a full 13.4 seconds just to build the dictionary and the remainder was largely taken up by the actual lookup with a half-second here and there attributed to the bulk seeding of the variant arrays from the worksheets' values.

         Multi_Col_Match_Array_in_Memory

So the Scripting.Dictionary is the clear winner here. Unfortunately, it is not automatically calculating updates in the various columns when values change but at this stage of development, the worksheet should be set to manual calculation. Setting one of the formula-based solutions into a recalculation event from a single retyped value seems an inefficient expenditure of time.

All-in-all, this makes perfect sense. The original array formula is analogous to an SQL SELECT statement with an INNER JOIN on two fields and if my SELECT statement was running inefficiently the first thing I would do to improve it would be to look at the tables' indexes.

On a related note, any workbook with this much data should be saved as a Excel Binary Workbook regardless of whether it is macro-enabled or not. The file size of a binary workbook (.XLSB) is typically ¹⁄₃ the size of an equivalent .XLSX or .XLSM. Beyond a faster initial load time, many bulk operations should prove faster.

Anyone wishing to test their own optimizations can find my sample .XLSB workbook here for the time being. Don't blindly run the procedures without seeing what you're getting into first.


¹ Array formulas need to be finalized with Ctrl+Shift+Enter↵. Once entered into the first cell correctly, they can be filled or copied down or right just like any other formula. Try and reduce your full-column references to ranges more closely representing the extents of your actual data. Array formulas chew up calculation cycles logarithmically so it is good practise to narrow the referenced ranges to a minimum. See Guidelines and examples of array formulas for more information.

Upvotes: 3

Related Questions