CaffeinatedMike
CaffeinatedMike

Reputation: 1607

Finding (NOT deleting) duplicate values(rows) in multi-dimensional array using Excel VBA

Building off of one of my past questions
What I'm looking to accomplish:

I'm looking to find and highlight duplicate Upcharges using VBA code based on multiple criteria:

  1. Product's XID (Column A)
  2. Upcharge Criteria 1 (Column CT)
  3. Upcharge Criteria 2 (Column CU)
  4. Upcharge Type (Column CV) and
  5. Upcharge Level (Column CW)

If there is more than one instance/row in a spreadsheet that share/match ALL of these criteria then that means the Upcharge is a duplicate. As seen in my previous post linked above:

What I've tried:

  1. Created a general formula (see below) that is inserted into a Helper column and copied all the way down the spreadsheet which points out which Upcharges are duplicate. This method was too resource heavy and took too long (8-10 minutes for all the formulas to calculate, but doesn't lag when filtering). Then I tried
  2. Evolved the general formula into a Conditional Formatting Formula and applied it to the Upcharge Name column via VBA code.(Takes same amount of time AND lags when filtering)
  3. I've also looked into possibly using a scripting.dictionary, but I'm not sure how (or if) that would work with a multi-dimensional array.

Now I've finally found the method I think will be much faster,

The faster method I'm looking to use: Dumping the aforementioned columns into a multi-dimensional array, finding the duplicate "rows" in the array, then highlighting the corresponding spreadsheet rows.

My attempt at the faster method: Here's how I populate the multi-dimensional array

Sub populateArray()
    Dim arrXID() As Variant, arrUpchargeOne() As Variant, arrUpchargeTwo() As Variant, arrUpchargeType() As Variant, arrUpchargeLevel() As Variant
    Dim arrAllData() As Variant
    Dim i As Long, lrow As Long
    lrow = ActiveSheet.Cells(Rows.Count, 1).End(xlUp).Row

    arrXID = Range("A2:A" & lrow) 'amend column number
    arrUpchargeOne = Range("CT2:CT" & lrow)
    arrUpchargeTwo = Range("CU2:CU" & lrow)
    arrUpchargeType = Range("CV2:CV" & lrow)
    arrUpchargeLevel = Range("CW2:CW" & lrow)

    ReDim arrAllData(1 To UBound(arrXID, 1), 4) As Variant
        For i = 1 To UBound(arrXID, 1)
            arrAllData(i, 0) = arrXID(i, 1)
            arrAllData(i, 1) = arrUpchargeOne(i, 1)
            arrAllData(i, 2) = arrUpchargeTwo(i, 1)
            arrAllData(i, 3) = arrUpchargeType(i, 1)
            arrAllData(i, 4) = arrUpchargeLevel(i, 1)
        Next i
End Sub

I can get the columns into the array, but I get stuck from there. I'm not sure how to go about checking for the duplicate "rows" in the array.

My questions:

  1. Is there a way I can apply my formula (see below) from my first attempt in my previous post and apply it inside the array?:
  2. Or, even better, is there a faster way I can find the duplicate "rows" inside the array?
  3. Then how could I go about highlighting the Upcharge Name (CS) cell in the spreadsheet rows that correspond with the "rows" in the array that were flagged as duplicates?

Formula from my previous post for reference:

=AND(SUMPRODUCT(($A$2:$A$" & lastRow & "=$A2)*($CT$2:$CT$" & lastRow & "=$CT2)*($CU$2:$CU$" & lastRow & "=$CU2)*($CV$2:$CV$" & lastRow & "=$CV2)*($CW$2:$CW$" & lastRow & "=$CW2))>1,$CT2 <> """")"
Returns TRUE if Upcharge is a duplicate 

Upvotes: 3

Views: 1487

Answers (5)

user4039065
user4039065

Reputation:

Conditional Formatting and Filtering

SUMPRODUCT vs COUNTIFS

First off, your choice of functions was inappropriate for such a large number of rows coupled with several conditions. A COUNTIFS function can perform many of the same multiple criteria operations that a SUMPRODUCT function can but in typically 25-35% of the calculation load and time. Additionally, full column references can be used without detriment in COUNTIFS as the column references are internally truncated at the limits of the Worksheet.UsedRange property.

Your standard formula can be written with COUNTIFS as,

=AND(COUNTIFS(A:A, A2, CT:CT, CT2, CU:CU, CU2, CV:CV, CV2, CW:CW, CW2)>1, CT2<>"")
'... or,
=COUNTIFS(A:A, A2, CT:CT, CT2, CT:CT, "<>", CU:CU, CU2, CV:CV, CV2, CW:CW, CW2)>1

Bringing the non-blank column CT condition directly into the COUNTIFS function actually improved calculation time slightly.

Only Calculate When You Have To

The original formula can be broken down to two main conditions.

  1. Is the cell in column CT non-blank?
  2. Do the values in five columns match the same five columns any other row?

A rudimentary IF function halts processing if the condition is not true. If the test for a non-blank cell in column CT is moved into a wrapping IF then the COUNTIFS (the bulk of the calculation) will only be processed if there is a value in the current row's CT column.

The improved standard formula becomes,

=IF(CT2<>"", COUNTIFS(A:A, A2, CT:CT, CT2, CU:CU, CU2, CV:CV, CV2, CW:CW, CW2)>1)

The benefits for this modification depend upon the number of blank cells in column CT. If only 1% of the 15,000 cells are blank, very little improvement will be noticed. However, if 50% of the cells in column CT are typically blank there will be a substantial improvement as you are literally knocking your calculation cycles in half.

Sorting the Data to Limit the Ranges

By far, the biggest calculation parasite is with the COUNTIFS looking through 15,000 rows of data in five separate columns. If the data was sorted on one or more of the criteria columns then it becomes unnecessary to look through all 15,000 rows for matches to all five columns of criteria.

For the purpose of this exercise, it will be assumed that column A is sorted in an ascending manner. If you want to test the hypothesis discussed here, sort the data now.

The INDEX function does more than return a value; it actually returns a valid cell address. When used in its most common lookup capacity, you see the value returned but in reality, unlike a similar VLOOKUP operation which only return the cell's value, INDEX is returning the actual cell; e.g. =A1, not the 99 that A1 contains. This hyper-functionality can be used to create valid ranges that can be used in other functions. e.g. A2:A9 can also be written as INDEX(A:A, 2):INDEX(A:A, 9).

This functionality cannot be used directly within a Conditional Formatting rule. However, it can be used in a Named Range and a Named Range can be used in a Conditional Formatting rule.

tl;dr

Sub lminyCFrule()

    Debug.Print Timer
    'Application.ScreenUpdating = False '<~~ uncomment this once you are no longer debugging
    On Error Resume Next    '<~~ needed for deleting objects without checking to see if they exist

    With Worksheets("Upcharge") '<~~ you know what worksheet you are supposed to be on
        If .AutoFilterMode Then .AutoFilterMode = False

        'delete any existing defined name called 'localXID' or 'local200'
        With .Parent
            .Names("localXID").Delete
            .Names("local200").Delete
        End With

        'create a new defined name called 'localXID' for CF rule method 1
        .Names.Add Name:="localXID", RefersToR1C1:= _
            "=INDEX('" & .Name & "'!C1:C104, MATCH('" & .Name & "'!RC1, '" & .Name & "'!C1, 0), 0):" & _
             "INDEX('" & .Name & "'!C1:C104, MATCH('" & .Name & "'!RC1, '" & .Name & "'!C1 ), 0)"
        'create a new defined name called 'local200' for CF rule method 2
        .Names.Add Name:="local200", RefersToR1C1:= _
            "=INDEX(Upcharge!C1:C104, MAX(2, ROW()-100), 0):INDEX(Upcharge!C1:C101, ROW()+100, 0)"

        With .Cells(1, 1).CurrentRegion
            'sort on column A in ascending order
             .Cells.Sort Key1:=.Columns(1), Order1:=xlAscending, _
                        Orientation:=xlTopToBottom, Header:=xlYes

            'create a CF rule on column CS
            With .Resize(.Rows.Count - 1, 1).Offset(1, 96)
                With .FormatConditions
                    .Delete
                    ' method 1 and method 2. Only use ONE of these!
                    ' method 1 - definitively start and end of XIDs in column A (slower, no mistakes)
                    '.Add Type:=xlExpression, Formula1:= _
                        "=IF(CT2<>"""", COUNTIFS(INDEX(localXID, 0, 1), A2, INDEX(localXID, 0, 98), CT2," & _
                                                "INDEX(localXID, 0, 99), CU2, INDEX(localXID, 0, 100), CV2," & _
                                                "INDEX(localXID, 0, 101), CW2)-1)"
                    ' method 2 - best guess at start and end of XIDs in column A (faster, guesswork at true scope)
                    .Add Type:=xlExpression, Formula1:= _
                        "=IF(CT2<>"""", COUNTIFS(INDEX(local200, 0, 1), A2, INDEX(local200, 0, 98), CT2," & _
                                                "INDEX(local200, 0, 99), CU2, INDEX(local200, 0, 100), CV2," & _
                                                "INDEX(local200, 0, 101), CW2)-1)"
                End With
                .FormatConditions(.FormatConditions.Count).Interior.ColorIndex = 3
            End With

            'Filter based on column CS is red
            .Columns(97).AutoFilter Field:=1, Criteria1:=vbRed, Operator:=xlFilterCellColor
        End With
    End With

    Application.ScreenUpdating = True
    Debug.Print Timer

End Sub

While not screaming fast, this does the job handily. The 'best guess' is faster than the 'definitive start and finish' but you run the risk of not completely covering the scope of the duplicates in column A. Of course, the offsets (e.g. 100 up and down) that control the scope could be adjusted.

Upvotes: 1

G&#252;rkan &#199;etin
G&#252;rkan &#199;etin

Reputation: 121

This might work like a magic trick, but not sure if it would work.

Could you just create another supportive (temporary) column, concatenating all four criteria?

ZZ_Temp = concatenate (CS; CV; CZ; etc)

This way, I suppose, you could show/highlight duplicates a lot faster.

Upvotes: 0

user4039065
user4039065

Reputation:

You say identify duplicates; I hear Scripting.Dictionary object.

Public Sub lminyDupes()
    Dim d As Long, str As String, vAs As Variant, vCTCWs As Variant
    Dim dDUPEs As Object                      '<~~ Late Binding
    'Dim dDUPEs As New Scripting.Dictionary   '<~~ Early Binding

    Debug.Print Timer
    Application.ScreenUpdating = False '<~~ uncomment this once you are no longer debugging

    'Remove the next line with Early Binding¹
    Set dDUPEs = CreateObject("Scripting.Dictionary")
    dDUPEs.comparemode = vbTextCompare

    With Worksheets("Upcharge") '<~~ you know what worksheet you are supposed to be on
        With .Cells(1, 1).CurrentRegion
            With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0)
                .Columns(97).Interior.Pattern = xlNone  '<~~ reset column CS

                'the following is intended to mimic a CF rule using this formula
                '=AND(COUNTIFS(A:A, A2, CT:CT, CT2, CU:CU, CU2, CV:CV, CV2, CW:CW, CW2)>1, SIGN(LEN(CT2)))

                vAs = .Columns(1).Value2
                vCTCWs = Union(.Columns(98), .Columns(99), .Columns(100), .Columns(101)).Value2

                For d = LBound(vAs, 1) To UBound(vAs, 1)
                    If CBool(Len(vCTCWs(d, 1))) Then
                        'make a key of the criteria values
                        str = Join(Array(vAs(d, 1), vCTCWs(d, 1), vCTCWs(d, 2), vCTCWs(d, 3), vCTCWs(d, 4)), ChrW(8203))
                        If dDUPEs.exists(str) Then
                            'the comboned key exists in the dictionary; append the current row
                            dDUPEs.Item(str) = dDUPEs.Item(str) & Chr(44) & "CS" & d
                        Else
                            'the combined key does not exist in the dictionary; store the current row
                            dDUPEs.Add Key:=str, Item:="CS" & d
                        End If
                    End If
                Next d

                'reuse a variant var to provide row highlighting
                Erase vAs
                For Each vAs In dDUPEs.keys
                    'if there is more than a single cell address, highlight all
                    If CBool(InStr(1, dDUPEs.Item(vAs), Chr(44))) Then _
                        .Range(dDUPEs.Item(vAs)).Interior.Color = vbRed
                Next vAs
            End With
        End With

    End With

    dDUPEs.RemoveAll: Set dDUPEs = Nothing
    Erase vCTCWs

    Application.ScreenUpdating = True
    Debug.Print Timer

End Sub

This seems faster than the formula approach.


¹ If you plan to convert the late binding of the Scripting.Dictionary object to early binding, you must add Microsoft Scripting Runtime to the VBE's Tools ► References.

Upvotes: 4

Parfait
Parfait

Reputation: 107767

Consider an SQL solution as this is a typical aggregate group by query where you filter for counts greater than 1. To go about your route requires many conditional logic within the loop across all elements of array.

While I recommend you simply import your data into a database like Excel's sibling MS Access, Excel can run SQL statements on its own workbook using an ADO connection (not to get into particulars but both Excel and Access uses the same Jet/ACE engine). And one good thing is you seem to be set up to run such a query with the table like structure of named columns.

The below example references your fields in a worksheet called Data (Data$) and query outputs to a worksheet called Results (with headers). Change names as needed. Two connection strings are included (one of which is commented out). Hopefully it runs on your end!

Sub RunSQL()

    Dim conn As Object, rst As Object
    Dim i As Integer, fld As Object
    Dim strConnection As String, strSQL As String

    Set conn = CreateObject("ADODB.Connection")
    Set rst = CreateObject("ADODB.Recordset")

    ' Connection and SQL Strings
'    strConnection = "DRIVER={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};" _
'                      & "DBQ=C:\Path\To\Workbook.xlsm;"
    strConnection = "Provider=Microsoft.ACE.OLEDB.12.0;" _
                       & "Data Source='C:\Path\To\Workbook.xlsm';" _
                       & "Extended Properties=""Excel 8.0;HDR=YES;"";"

    strSQL = " SELECT [Data$].[Product's XID], [Data$].[Upcharge Criteria 1]," _
                & " [Data$].[Upcharge Criteria 2], [Data$].[Upcharge Type]," _
                & " [Data$].[Upcharge Type], [Data$].[Upcharge Level]" _ 
                & " FROM [Data$]" _
                & " GROUP BY [Data$].[Product's XID], [Data$].[Upcharge Criteria 1]," _
                & " [Data$].[Upcharge Criteria 2], [Data$].[Upcharge Type]," _
                & " [Data$].[Upcharge Type], [Data$].[Upcharge Level]," _
                & " [Data$].[Product's XID]" _
                & " HAVING COUNT(*) > 1;"

    ' Open the db connection
    conn.Open strConnection
    rst.Open strSQL, conn

    ' Column headers
    i = 0
    Worksheets("Results").Range("A1").Activate
    For Each fld In rst.Fields
        ActiveCell.Offset(0, i) = fld.Name
        i = i + 1
    Next fld

    ' Data rows        
    Worksheets("Results").Range("A2").CopyFromRecordset rst

    rst.Close
    conn.Close

End Sub

Upvotes: 0

Sixthsense
Sixthsense

Reputation: 1975

Why don't you remove the Indirect() and replace the Countif() function with some stable Row reference. Since Indirect() part is a volatile and instead of using Indirect() you can straight away use some stable row reference like $A$2:$A$50000 which may show some significant change in performance.

Or

Use Create Table for your data. Use Table reference in your formula which will work faster than Indirect() reference.

Edit

Your actual formula

=AND(SUMPRODUCT(($A$2:$A$500=$A2)*($CU$2:$CU$500=$CU2)*($CV$2:$CV$500=$CV2)*($CW$2:$CW$500=$CW2)*($CX$2:$CX$500=$CX2))>1,$CU2 <> "")

Why don't you convert it to Counti(S) with stable reference like the below?

=AND(COUNTIFS($A$2:$A$500,$A2,$CU$2:$CU$500,$CU2,$CV$2:$CV$500,$CV2,$CW$2:$CW**$500,$CW2,$CX$2:$CX$500,$CX2)>1,$CU12<>"")

Upvotes: 0

Related Questions