Reputation: 103
I have a large data set that is exported from a website. I use a macro in my main ‘filter’ workbook to find the file and copy the data from Sheet1 of the exported file into Sheet1 of the filter workbook.
Once the data is copied into Sheet1 of the filter workbook, I use VBA to copy columns A/B/D/F/H/Z/AA/etc from Sheet 1 of the filter workbook into Sheet2 of the filter workbook AND also at the same time, I use this code here to attempt to delete any duplicate rows:
Worksheets("Sheet2").Range("A:DZ").RemoveDuplicates Columns:=15, Header:=xlYes
I am finding though that the RemoveDuplicates is not working as expected.
As an example, Sheet1 in the filter workbook (and export workbook) has 3344 rows. When I manually filter using conditional formatting to highlight duplicates, I can find 314 rows listed as duplicates (meaning 157 genuine rows of actual data and 157 rows which are duplicates of that actual data. I haven’t found any examples of duplicates existing more than one time each). So on Sheet2 I was expecting to see 3344 – 157 = 3157 Rows of real data. I don’t see that, or even 3030 rows (3344-314). Instead, I am getting 1897 rows paste into Sheet2, which is a difference of 1447 rows (1290 less rows than expected).
On top of that, I am manually checking the data to see what is up by using Control-F in the column and am finding that in some instances that both of the two duplicated items are missing from Sheet2 (versus it just deleting the one duplicate row).
The duplicate data is not on sequential rows and is scattered throughout the column in Sheet2. But when I sort it before I attempt to DeleteDuplicates, it does not seem to impact its accuracy or make it right. I have also tried using the DeleteDuplicates in different locations of the code / at different times but the results are always off by the same amount (1447 missing rows instead of 157 missing rows).
I found only a few articles on this site, this one was the closest but not quiet my issue: Delete Rows With Duplicate Data VBA
And other help sites/forums mention there was some bug with office 2007 that prevents this from working (am using 2013).
Does anyone know where I can find a work around, or if one exists - or if this is still a real bug or just a problem with the code line I have above.
Adding bits of code I use in this example in case it is something within these that is causing the problem…
Part of the copy code:
wsFromSheet.Cells.Copy
wsToFile.Range("A1").PasteSpecial xlPasteValuesAndNumberFormats
Application.CutCopyMode = False
wbFromFile.Close True
Part of the ‘paste column code’:
Sheets("Sheet2").Rows(1 & ":" & Sheets("Sheet2").Columns.Count).ClearContents 'Clear from row 1 down
LastRowFromSiteTracker = xSht.Cells.SpecialCells(xlCellTypeLastCell).Row 'original report has 128 columns
xSht.Range("B1:B" & LastRowFromSiteTracker).Copy ySht.Cells(Rows.Count, "A").End(xlUp) 'customer name
‘repeat a bunch of times, then…
Application.CutCopyMode = False 'do I need this line?
Worksheets("Sheet2").Range("A:DZ").RemoveDuplicates Columns:=15, Header:=xlYes
End Sub
Example/sample of data:
Row Source Data Expected Data Actual Data
1 1000474608 1000474608 1000474608 (Dup missing from sheet2)
2 1000474608 1000487672 1000487672
3 1000487672 1000487674 1000487674
4 1000487674 1000487676 1000487676 (missing from sheet2, wasn’t a dup)
5 1000487676 1000487678 1000487678
6 1000487678 1000487680 1000487680
7 1000487680 1000487682 1000487682 (Dup missing from sheet2)
8 1000487682 1000520278 1000520278
9 1000487682 1000520280 1000520280
10 1000520278 1000520282 1000520282 (Is there)
11 1000520280 1000520286 1000520286
12 1000520282 1000520336 1000520336 (Is there)
13 1000520282 1000520338 1000520338
14 1000520286 1000520392 1000520392
15 1000520286 1000520394 1000520394
16 1000520336 1000530333 1000530333
17 1000520338
18 1000520392
19 1000520394
20 1000530333
EDIT: EDIT: EDIT: So I've tried to do some more manual testing, and tried two separate things with the same set of data, getting two different results. I used the Conditional Formatting - Highlight Duplicates from the main Home ribbon and the Remove Duplicates from the Data ribbon.
The Remove Duplicates button finds and removed 163 items in Column P and leaves 3181 rows.
However, when I use the Highlight Duplicates conditional format finds 314 items that are duplicated within Column P, leaving 3030 non duplicates.
It does not make sense that these two numbers do not match. I thought it has something to do with the duplicates themselves - as most duplicated items have only one dup of itself (123123 shows up in two rows) but then just a small handful of rows are duplicated multiple times (234234 shows up in 4 or 6 columns).
So instead of using the manual way, I used the suggestions I've found online, and both of these also provide differing results when run:
3344 Base records
1897 left after scrub of duplicates (1446 removed)
Dim tmpAddress As String
tmpAddress = "A2:BZ" & Worksheets("ColScrub").UsedRange.Rows.Count
Worksheets("ColScrub").Range(tmpAddress).RemoveDuplicates Columns:=15, Header:=xlNo
3181 left after scrub of duplicates (162 removed)
Cells.Select
ActiveSheet.Range("$A$1:$EI$3345").RemoveDuplicates Columns:=31, Header:=xlYes
Upvotes: 2
Views: 9925
Reputation: 46
My further experience now shows that UsedRange is completely unreliable if you have blank rows or columns. UsedRange only includes rows/columns up to the blank one. I have found a better way to get the last of each. These function use 2 basic assumptions, which should hold true for mostof your spreadsheets.
With this in mind, I have created the following 2 functions retrieve the last values accurately, every time ... well, almost (my complete function handles issues of the footer rows with merged cells)
For the last row
Public Function Excel_GetLastRow(xlSheet As Excel.Worksheet, _
ByVal KeyColumn As Long) As Long
' This could be adjusted for exact max rows Excel allows
Const MAX_XL_ROWS As Long = 1048000
Excel_GetLastRow = xlSheet.Cells(MAX_XL_ROWS, KeyColumn).End(xlUp).row
End Function
And for last column
Public Function Excel_GetLastCol(xlSheet As Excel.Worksheet, _
ByVal HeaderRow As Long) As Long
' This could be adjusted for exact max columns Excel allows
Const MAX_XL_COLS As Long = 16000
Excel_GetLastCol = xlSheet.Cells(MAX_XL_COLS, HeaderRow).End(xlToLeft).Column
End Function
Using these values you can now set your complete data range successfully.
top left = Cells(HeaderRow + 1, 1)
bottom right = Cells(LastRow, LastCol)
My complete functions include error handling and allowances for possible merged cells in the footer row and last header column, but you get the idea.
Art
Upvotes: 1
Reputation: 103
I don't know why (or if/how) this is any different, but this seems to be the closest I can get to true removal of duplicates. I wanted to add it here as an answer for others in similar situations.
Dim lastrow As Long
With ThisWorkbook.Worksheets("ColScrub")
If Application.WorksheetFunction.CountA(.Cells) <> 0 Then
lastrow = .Cells.Find(What:="*", _
After:=.Range("A1"), _
Lookat:=xlPart, _
LookIn:=xlFormulas, _
SearchOrder:=xlByRows, _
SearchDirection:=xlPrevious, _
MatchCase:=False).Row 'Change P1 back to A1 if needed
Else
lastrow = 1
End If
.Range("A1:AZ" & lastrow).RemoveDuplicates Columns:=Array(16), Header:=xlYes
End With
I have to go through each row visually to prove this works I think, and to rule out that it isn't deleting things that should not be deleted - but this seems to get rid of the 'double duplicates' (4-6 lines items instead of 2 like the other duplicates).
Upvotes: 0
Reputation: 46
1) you are only clearing as many rows as your have columns, not rows Also, you may not be clearing anything, so use the UsedRange.Rows for proper rowcount
This line ...
Sheets("Sheet2").Rows(1 & ":" & Sheets("Sheet2").Columns.Count).ClearContents
Should read ...
Sheets("Sheet2").Rows(1 & ":" & Sheets("Sheet2").UsedRange.Rows.Count).ClearContents
Without properly clearing the old data, unpredictable results may occur.
2) Excel VBA seems rather quirky in that many things won't work correctly without specifically "selecting" the object(s) in question AND specifiying complete (not columnar) ranges
3) I also prefer to leave out the header row (note the "A2") and pass Header:=xlNo
4) Unless you have more than 625 columns, BZ should be far enough
So add this to your code ...
Dim tmpAddress as String
tmpAddress = "A2:BZ" & Worksheets("Sheet2").UsedRange.Rows.Count
Worksheets("Sheet2").Activate
Worksheets("Sheet2").Range(tmpAddress).RemoveDuplicates Columns:=15, Header:=xlNo
Hope this helps :)
Upvotes: 0