Reputation: 123
I have a spreadsheet that contains over 100k rows in a single column (I know crazy) and I need to find an efficient way to highlight partial duplicates and remove them. All the records are all in the same format, but may have an additional letter attached at the end. I would like to keep the first instance of the partial duplicate, and remove all instances after.
So from this:
1234 W
1234 T
9456 S
1234 T
To This:
1234 W
9456 S
I was going to use the formula below to conditionally highlight the partial dupes, but i receive an error "You may not use reference operators (such as unions....) or array constants for Conditional Formatting criteria" and use VBA to remove those highlighted cells.
=if(A1<>"",Countif(A$1:A,left(A1,4)& "*") > 1)
Any thoughts? I know conditional formatting is memory intensive, so if there's any way to perform this using VBA I'm open to suggestion.
Upvotes: 0
Views: 2195
Reputation: 6368
Here is one way to remove the duplicates quickly:
Text to Columns
, using space delimiter.
Remove Duplicates
referring to duplicates in the first column only.
Merge the content of each row with =Concatenate(A1, B1)
.
Upvotes: 1
Reputation: 4486
PreprocessAndRemoveDuplicates
.)"Sheet1"
.Option Explicit
Private Sub PreprocessAndRemoveDuplicates()
Dim targetSheet As Worksheet
Set targetSheet = ThisWorkbook.Worksheets("Sheet1") ' Change to whatever yours is called. You could use code name instead too.
Dim lastCell As Range
Set lastCell = targetSheet.Cells(targetSheet.Rows.Count, "A").End(xlUp)
Debug.Assert lastCell.Row > 1
Dim inputArray() As Variant
inputArray = targetSheet.Range("A1", lastCell) ' Assumes data starts from A1.
Dim uniqueValues As Scripting.Dictionary
Set uniqueValues = New Scripting.Dictionary
Dim rowIndex As Long
For rowIndex = LBound(inputArray, 1) To UBound(inputArray, 1)
Dim currentKey As String
currentKey = GetKeyFromValue(CStr(inputArray(rowIndex, 1)))
If Not uniqueValues.Exists(currentKey) Then ' Only first instance added.
uniqueValues.Add currentKey, inputArray(rowIndex, 1)
End If
Next rowIndex
WriteDictionaryItemsToSheet uniqueValues, targetSheet.Cells(1, lastCell.Column)
End Sub
Private Function GetKeyFromValue(ByVal someText As String, Optional charactersToExtract As Long = 4) As String
' If below logic is not correct/appropriate for your scenario, replace with whatever it should be.
' Presently this just gets the first N characters of the string, where N is 4 by default.
GetKeyFromValue = Left$(someText, charactersToExtract)
End Function
Private Sub WriteDictionaryItemsToSheet(ByVal someDictionary As Scripting.Dictionary, ByVal firstCell As Range)
Dim initialArray() As Variant
initialArray = someDictionary.Items()
Dim arrayToWriteToSheet() As Variant
arrayToWriteToSheet = StandardiseArray(initialArray)
With firstCell
.EntireColumn.ClearContents
.Resize(UBound(arrayToWriteToSheet, 1), UBound(arrayToWriteToSheet, 2)).Value = arrayToWriteToSheet
End With
End Sub
Private Function StandardiseArray(ByRef someArray() As Variant) As Variant()
' Application.Transpose might be limited to ~65k
Dim baseDifference As Long
baseDifference = 1 - LBound(someArray)
Dim rowCount As Long ' 1 based
rowCount = UBound(someArray) - LBound(someArray) + 1
Dim outputArray() As Variant
ReDim outputArray(1 To rowCount, 1 To 1)
Dim readIndex As Long
Dim writeIndex As Long
For readIndex = LBound(someArray) To UBound(someArray)
writeIndex = writeIndex + 1
outputArray(writeIndex, 1) = someArray(readIndex)
Next readIndex
StandardiseArray = outputArray
End Function
Processed 1 million values (A1:A1000000
) in under 3 seconds on my machine, but performance on your machine may differ.
Upvotes: 0