Reputation: 349
My spreadsheet has about 800,000 rows with 30 columns. Customers are interested in duplicate values only in one column. They need the entire row back. For e.g.
MemberId|Name|Address|CircleScore
H111|John Doe|123 W Main|2.4
H222|Jane Doe|124 W Main|3.2
H333|Bob Doe|125 W Main|2.5
H444|Jake Doe|126 W Main|2.1
H555|Mike Doe|127 W Main|2.4
They want the entire rows where there are duplicates in CircleScore. So my filtered excel should only contain:
MemberId|Name|Address|CircleScore
H111|John Doe|123 W Main|2.4
H555|Mike Doe|127 W Main|2.4
I tried highlighting duplicate CircleScore and filtering, but the filtering part takes for ever. I have waited for 15 minutes but still no luck. The duplicates could be around 150k.
Is there an alternative?
Upvotes: 4
Views: 1477
Reputation: 1
Try this Vba-code (and learn a little bit Dutch)
Sub DuplicatesInColumn()
'maakt een lijst met de aangetroffen dubbelingen
Dim LaatsteRij As Long
Dim MatchNr As Long
Dim iRij, iKolom, iTeller, Teller As Long, ControlKolom As Long
iRij = 1
iKolom = 5 'number of columns in the sheet, Chance if not correct
ControlKolom = 4 'column number where to find the doubles, Chance if not correct
LaatsteRij = Cells(65000, iKolom).End(xlUp).Row: iTeller = iKolom
Sheet1.Activate
For iRij = 1 To LaatsteRij
If Cells(iRij, ControlKolom) <> "" Then
MatchNr = WorksheetFunction.Match(Cells(iRij, ControlKolom), Range(Cells(1, ControlKolom), Cells(LaatsteRij, ControlKolom)), 0)
If iRij <> MatchNr Then
iTeller = iKolom
For Teller = 1 To iTeller
Cells(iRij, iKolom + Teller).Offset(0, 2).Value = Range(Cells(iRij, Teller), Cells(iRij, Teller)).Value
Next Teller
End If: End If
Next
End Sub
Upvotes: 0
Reputation:
Please disregard this submission if you are a) getting paid by the hour and feel underpaid, b) planning on a nap while the routine processes, or c) both a) and b).
With any data set approaching 800K rows (with 30 columns) you are going to want to step into the variant array arena. With processing typically 5-7% of the time it takes to work with the worksheet values, it is very appropriate for large data blocks.
Anytime that the word 'duplicates' comes into play, I immediately start thinking about how a Scripting.Dictionary object's unique index on its Keys can benefit. In this solution I used a pair of dictionaries to identify the rows of data with a repeated Circle Score value.
Twenty-four million cells of data is a lot to store and transfer. Bulk methods beat individual methods every time and the bulkiest method of peeling off the data would be to stuff all 800K rows × 30 columns into a variant array. All processing becomes in-memory and the results are returned to a report worksheet en masse.
isolateDuplicateCircleScores code
Sub isolateDuplicateCircleScores()
Dim d As Long, v As Long, csc As Long, stmp As String
Dim ky As Variant, itm As Variant, vVALs As Variant, dCSs As Object, dDUPs As Object
Dim w As Long, vWSs As Variant
'early binding
'dim dCSs As new scripting.dictionary, dDUPs As new scripting.dictionary
appTGGL bTGGL:=False
'late binding - not necessary with Early Binding (see footnote ¹)
Set dCSs = CreateObject("Scripting.Dictionary")
Set dDUPs = CreateObject("Scripting.Dictionary")
'set to the defaults (not necessary)
dCSs.comparemode = vbBinaryCompare
dDUPs.comparemode = vbBinaryCompare
'for testing on multiple row number scenarios
'vWSs = Array("CircleScores_8K", "CircleScores_80K", "CircleScores_800K")
'for runtime
vWSs = Array("CircleScores") '<~~ your source worksheet here
For w = LBound(vWSs) To UBound(vWSs)
'ThisWorkbook.Save
Debug.Print vWSs(w)
Debug.Print Timer
With Worksheets(vWSs(w))
On Error Resume Next
Worksheets(vWSs(w) & "_dupes").Delete
On Error GoTo 0
ReDim vVALs(0)
dCSs.RemoveAll
dDUPs.RemoveAll
'prep a new worksheet to receive the duplicates
.Cells(1, 1).CurrentRegion.Resize(2).Copy
With Worksheets.Add(after:=Worksheets(.Index))
.Name = vWSs(w) & "_dupes"
With .Cells(1, 1)
.PasteSpecial Paste:=xlPasteAllUsingSourceTheme, Operation:=xlNone
.PasteSpecial Paste:=xlPasteColumnWidths, Operation:=xlNone
.Value = .Value2
.Offset(1, 0).EntireRow.ClearContents
End With
End With
'finish prep with freeze row 1 and zoom to 80%
With Application.Windows(1)
.SplitColumn = 0
.SplitRow = 1
.FreezePanes = True
.Zoom = 80
End With
'grab all of the data into a variant array
ReDim vVALs(0)
csc = Application.Match("CircleScore", .Rows(1), 0) 'CircleScore column number needed later
vVALs = .Range(.Cells(2, 1), _
.Cells(.Cells(Rows.Count, csc).End(xlUp).Row, _
.Cells(1, Columns.Count).End(xlToLeft).Column)).Value2
'Debug.Print LBound(vVALs, 1) & ":" & UBound(vVALs, 1) '1:~800K
'Debug.Print LBound(vVALs, 2) & ":" & UBound(vVALs, 2) '1:~30
End With 'done with the original worksheet
'populate the dDUPs dictionary using the key index in dCSs
For v = LBound(vVALs, 1) To UBound(vVALs, 1)
If dCSs.exists(vVALs(v, csc)) Then
stmp = vVALs(v, 1)
For d = LBound(vVALs, 2) + 1 To UBound(vVALs, 2)
stmp = Join(Array(stmp, vVALs(v, d)), ChrW(8203))
Next d
dDUPs.Add Key:=v, Item:=stmp
If Not dDUPs.exists(dCSs.Item(vVALs(v, csc))) Then
stmp = vVALs(dCSs.Item(vVALs(v, csc)), 1)
For d = LBound(vVALs, 2) + 1 To UBound(vVALs, 2)
stmp = Join(Array(stmp, vVALs(dCSs.Item(vVALs(v, csc)), d)), ChrW(8203))
Next d
dDUPs.Add Key:=dCSs.Item(vVALs(v, csc)), Item:=stmp
End If
Else
dCSs.Item(vVALs(v, csc)) = v
End If
Next v
'split the dDUPs dictionary items back into a variant array
d = 1
ReDim vVALs(1 To dDUPs.Count, 1 To UBound(vVALs, 2))
For Each ky In dDUPs.keys
itm = Split(dDUPs.Item(ky), ChrW(8203))
For v = LBound(itm) To UBound(itm)
vVALs(d, v + 1) = itm(v)
Next v
d = d + 1
Next ky
'put the values into the duplicates worksheet
With Worksheets(vWSs(w) & "_dupes")
.Cells(2, 1).Resize(UBound(vVALs, 1), UBound(vVALs, 2)) = vVALs
With .Cells(1, 1).CurrentRegion
With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0)
.Rows(1).Copy
.PasteSpecial Paste:=xlPasteFormats, Operation:=xlNone
End With
.Cells.Sort Key1:=.Columns(csc), Order1:=xlAscending, _
Key2:=.Columns(1), Order2:=xlAscending, _
Orientation:=xlTopToBottom, Header:=xlYes
End With
End With
Debug.Print Timer
Next w
dCSs.RemoveAll: Set dCSs = Nothing
dDUPs.RemoveAll: Set dDUPs = Nothing
appTGGL
End Sub
Public Sub appTGGL(Optional bTGGL As Boolean = True)
With Application
.ScreenUpdating = bTGGL
.EnableEvents = bTGGL
.DisplayAlerts = bTGGL
.AutoRecover.Enabled = bTGGL 'no interruptions with an auto-save
.Calculation = IIf(bTGGL, xlCalculationAutomatic, xlCalculationManual)
.CutCopyMode = False
.StatusBar = vbNullString
End With
Debug.Print Timer
End Sub
800K rows × 30 columns of random sample data
~123K rows × 30 columns of duplicate rows (sorted and formatted in about a minute-and-a-half)
tbh, I never got the 32-bit version of Excel on the older laptop to run the 800K pass run more than once without restarting Excel. Once restarted the results were consistent with what is shown. The 64-bit Excel ran repeatedly without a hiccup.
Large Worksheet Addendum
When dealing with worksheets containing large data blocks there are a few general improvements that can limit your wait times. You're using Excel as a medium sized database tool so treat the data worksheet as the raw data that it should be.
TODAY()
in a COUNTIF formula filled down for the extent of the rows will have you sitting on your thumb more often than not.¹ If you can convert the late binding of the Scripting.Dictionary to early binding, you must add Microsoft Scripting Runtime to the VBE's Tools ► References.
² Volatile functions recalculate whenever anything in the entire workbook changes, not just when something that affects their outcome changes. Examples of volatile functions are INDIRECT, OFFSET, TODAY, NOW, RAND and RANDBETWEEN. Some sub-functions of the CELL and INFO worksheet functions will make them volatile as well.
Upvotes: 1
Reputation: 20463
I would create an Is_Duplicated
indicator column and use that to filter the duplicated CircleScores
:
Alternatively, you can sort
the CircleScore
column and make the formula a bit less taxing on your system (NOTE CircleScore
must be sorted beforehand):
Upvotes: 3