MSauce
MSauce

Reputation: 123

Highlight and Remove Partial Duplicates in Excel

I have a spreadsheet that contains over 100k rows in a single column (I know crazy) and I need to find an efficient way to highlight partial duplicates and remove them. All the records are all in the same format, but may have an additional letter attached at the end. I would like to keep the first instance of the partial duplicate, and remove all instances after.

So from this:

1234 W  
1234 T  
9456 S  
1234 T

To This:

1234 W  
9456 S

I was going to use the formula below to conditionally highlight the partial dupes, but i receive an error "You may not use reference operators (such as unions....) or array constants for Conditional Formatting criteria" and use VBA to remove those highlighted cells.

=if(A1<>"",Countif(A$1:A,left(A1,4)& "*") > 1)

Any thoughts? I know conditional formatting is memory intensive, so if there's any way to perform this using VBA I'm open to suggestion.

Upvotes: 0

Views: 2195

Answers (2)

cybernetic.nomad
cybernetic.nomad

Reputation: 6368

Here is one way to remove the duplicates quickly:

  1. Text to Columns, using space delimiter.

  2. Remove Duplicates referring to duplicates in the first column only.

  3. Merge the content of each row with =Concatenate(A1, B1).

Upvotes: 1

chillin
chillin

Reputation: 4486

  • If the "unique identifier" of each value is just its first 4 characters, then maybe the code below will be okay for you.
  • I recommend making a copy of your file before running any code, as code tries to overwrite the contents of column A. (The procedure to run is PreprocessAndRemoveDuplicates.)
  • You may need to change the name of the sheet (in the code). I assumed "Sheet1".
  • Code assumes data is only in column A.

Option Explicit

Private Sub PreprocessAndRemoveDuplicates()
    Dim targetSheet As Worksheet
    Set targetSheet = ThisWorkbook.Worksheets("Sheet1") ' Change to whatever yours is called. You could use code name instead too.

    Dim lastCell As Range
    Set lastCell = targetSheet.Cells(targetSheet.Rows.Count, "A").End(xlUp)
    Debug.Assert lastCell.Row > 1

    Dim inputArray() As Variant
    inputArray = targetSheet.Range("A1", lastCell) ' Assumes data starts from A1.

    Dim uniqueValues As Scripting.Dictionary
    Set uniqueValues = New Scripting.Dictionary

    Dim rowIndex As Long
    For rowIndex = LBound(inputArray, 1) To UBound(inputArray, 1)
        Dim currentKey As String
        currentKey = GetKeyFromValue(CStr(inputArray(rowIndex, 1)))

        If Not uniqueValues.Exists(currentKey) Then ' Only first instance added.
            uniqueValues.Add currentKey, inputArray(rowIndex, 1)
        End If
    Next rowIndex

    WriteDictionaryItemsToSheet uniqueValues, targetSheet.Cells(1, lastCell.Column)
End Sub

Private Function GetKeyFromValue(ByVal someText As String, Optional charactersToExtract As Long = 4) As String
    ' If below logic is not correct/appropriate for your scenario, replace with whatever it should be.
    ' Presently this just gets the first N characters of the string, where N is 4 by default.
    GetKeyFromValue = Left$(someText, charactersToExtract)
End Function

Private Sub WriteDictionaryItemsToSheet(ByVal someDictionary As Scripting.Dictionary, ByVal firstCell As Range)
    Dim initialArray() As Variant
    initialArray = someDictionary.Items()

    Dim arrayToWriteToSheet() As Variant
    arrayToWriteToSheet = StandardiseArray(initialArray)

    With firstCell
        .EntireColumn.ClearContents
        .Resize(UBound(arrayToWriteToSheet, 1), UBound(arrayToWriteToSheet, 2)).Value = arrayToWriteToSheet
    End With
End Sub

Private Function StandardiseArray(ByRef someArray() As Variant) As Variant()
    ' Application.Transpose might be limited to ~65k
    Dim baseDifference As Long
    baseDifference = 1 - LBound(someArray)

    Dim rowCount As Long ' 1 based
    rowCount = UBound(someArray) - LBound(someArray) + 1

    Dim outputArray() As Variant
    ReDim outputArray(1 To rowCount, 1 To 1)

    Dim readIndex As Long
    Dim writeIndex As Long
    For readIndex = LBound(someArray) To UBound(someArray)
        writeIndex = writeIndex + 1
        outputArray(writeIndex, 1) = someArray(readIndex)
    Next readIndex

    StandardiseArray = outputArray
End Function

Processed 1 million values (A1:A1000000) in under 3 seconds on my machine, but performance on your machine may differ.

Upvotes: 0

Related Questions