Eric King
Eric King

Reputation: 103

RemoveDuplicates is not working as expected

I have a large data set that is exported from a website. I use a macro in my main ‘filter’ workbook to find the file and copy the data from Sheet1 of the exported file into Sheet1 of the filter workbook.

Once the data is copied into Sheet1 of the filter workbook, I use VBA to copy columns A/B/D/F/H/Z/AA/etc from Sheet 1 of the filter workbook into Sheet2 of the filter workbook AND also at the same time, I use this code here to attempt to delete any duplicate rows:

Worksheets("Sheet2").Range("A:DZ").RemoveDuplicates Columns:=15, Header:=xlYes

I am finding though that the RemoveDuplicates is not working as expected.

As an example, Sheet1 in the filter workbook (and export workbook) has 3344 rows. When I manually filter using conditional formatting to highlight duplicates, I can find 314 rows listed as duplicates (meaning 157 genuine rows of actual data and 157 rows which are duplicates of that actual data. I haven’t found any examples of duplicates existing more than one time each). So on Sheet2 I was expecting to see 3344 – 157 = 3157 Rows of real data. I don’t see that, or even 3030 rows (3344-314). Instead, I am getting 1897 rows paste into Sheet2, which is a difference of 1447 rows (1290 less rows than expected).

On top of that, I am manually checking the data to see what is up by using Control-F in the column and am finding that in some instances that both of the two duplicated items are missing from Sheet2 (versus it just deleting the one duplicate row).

The duplicate data is not on sequential rows and is scattered throughout the column in Sheet2. But when I sort it before I attempt to DeleteDuplicates, it does not seem to impact its accuracy or make it right. I have also tried using the DeleteDuplicates in different locations of the code / at different times but the results are always off by the same amount (1447 missing rows instead of 157 missing rows).

I found only a few articles on this site, this one was the closest but not quiet my issue: Delete Rows With Duplicate Data VBA

And other help sites/forums mention there was some bug with office 2007 that prevents this from working (am using 2013).

Does anyone know where I can find a work around, or if one exists - or if this is still a real bug or just a problem with the code line I have above.

Adding bits of code I use in this example in case it is something within     these that is causing the problem…
Part of the copy code: 
wsFromSheet.Cells.Copy
wsToFile.Range("A1").PasteSpecial xlPasteValuesAndNumberFormats
Application.CutCopyMode = False
wbFromFile.Close True

Part of the ‘paste column code’:
Sheets("Sheet2").Rows(1 & ":" &     Sheets("Sheet2").Columns.Count).ClearContents 'Clear from row 1 down
LastRowFromSiteTracker = xSht.Cells.SpecialCells(xlCellTypeLastCell).Row      'original report has 128 columns
xSht.Range("B1:B" & LastRowFromSiteTracker).Copy ySht.Cells(Rows.Count,    "A").End(xlUp)      'customer name
‘repeat a bunch of times, then…
Application.CutCopyMode = False  'do I need this line?
Worksheets("Sheet2").Range("A:DZ").RemoveDuplicates Columns:=15,   Header:=xlYes 
End Sub


Example/sample of data:

Row Source Data Expected Data   Actual Data
1   1000474608  1000474608  1000474608 (Dup missing from sheet2)
2   1000474608  1000487672  1000487672
3   1000487672  1000487674  1000487674
4   1000487674  1000487676  1000487676 (missing from sheet2, wasn’t a dup)
5   1000487676  1000487678  1000487678
6   1000487678  1000487680  1000487680
7   1000487680  1000487682  1000487682 (Dup missing from sheet2)
8   1000487682  1000520278  1000520278
9   1000487682  1000520280  1000520280
10  1000520278  1000520282  1000520282 (Is there)
11  1000520280  1000520286  1000520286
12  1000520282  1000520336  1000520336 (Is there)
13  1000520282  1000520338  1000520338
14  1000520286  1000520392  1000520392
15  1000520286  1000520394  1000520394
16  1000520336  1000530333  1000530333
17  1000520338      
18  1000520392      
19  1000520394      
20  1000530333      

EDIT: EDIT: EDIT: So I've tried to do some more manual testing, and tried two separate things with the same set of data, getting two different results. I used the Conditional Formatting - Highlight Duplicates from the main Home ribbon and the Remove Duplicates from the Data ribbon.

The Remove Duplicates button finds and removed 163 items in Column P and leaves 3181 rows.

However, when I use the Highlight Duplicates conditional format finds 314 items that are duplicated within Column P, leaving 3030 non duplicates.

It does not make sense that these two numbers do not match. I thought it has something to do with the duplicates themselves - as most duplicated items have only one dup of itself (123123 shows up in two rows) but then just a small handful of rows are duplicated multiple times (234234 shows up in 4 or 6 columns).

So instead of using the manual way, I used the suggestions I've found online, and both of these also provide differing results when run:

3344 Base records

1897 left after scrub of duplicates  (1446 removed)

Dim tmpAddress As String
tmpAddress = "A2:BZ" & Worksheets("ColScrub").UsedRange.Rows.Count 
Worksheets("ColScrub").Range(tmpAddress).RemoveDuplicates Columns:=15, Header:=xlNo


3181 left after scrub of duplicates (162 removed)

Cells.Select
ActiveSheet.Range("$A$1:$EI$3345").RemoveDuplicates Columns:=31, Header:=xlYes

Upvotes: 2

Views: 9925

Answers (3)

tiedied61
tiedied61

Reputation: 46

My further experience now shows that UsedRange is completely unreliable if you have blank rows or columns. UsedRange only includes rows/columns up to the blank one. I have found a better way to get the last of each. These function use 2 basic assumptions, which should hold true for mostof your spreadsheets.

  1. For LastRow there is a "key" column, i.e. a column where the MUST be data, for example an ID column
  2. For LastCol there should be a header row (or row where you can guarantee the last column is filled)

With this in mind, I have created the following 2 functions retrieve the last values accurately, every time ... well, almost (my complete function handles issues of the footer rows with merged cells)

For the last row Public Function Excel_GetLastRow(xlSheet As Excel.Worksheet, _ ByVal KeyColumn As Long) As Long ' This could be adjusted for exact max rows Excel allows Const MAX_XL_ROWS As Long = 1048000 Excel_GetLastRow = xlSheet.Cells(MAX_XL_ROWS, KeyColumn).End(xlUp).row End Function

And for last column

Public Function Excel_GetLastCol(xlSheet As Excel.Worksheet, _
                                         ByVal HeaderRow As Long) As Long
' This could be adjusted for exact max columns Excel allows
Const MAX_XL_COLS As Long = 16000
Excel_GetLastCol = xlSheet.Cells(MAX_XL_COLS, HeaderRow).End(xlToLeft).Column
End Function

Using these values you can now set your complete data range successfully.

top left = Cells(HeaderRow + 1, 1)

bottom right = Cells(LastRow, LastCol)

My complete functions include error handling and allowances for possible merged cells in the footer row and last header column, but you get the idea.

Art

Upvotes: 1

Eric King
Eric King

Reputation: 103

I don't know why (or if/how) this is any different, but this seems to be the closest I can get to true removal of duplicates. I wanted to add it here as an answer for others in similar situations.

 Dim lastrow As Long


With ThisWorkbook.Worksheets("ColScrub")
    If Application.WorksheetFunction.CountA(.Cells) <> 0 Then
        lastrow = .Cells.Find(What:="*", _
                      After:=.Range("A1"), _
                      Lookat:=xlPart, _
                      LookIn:=xlFormulas, _
                      SearchOrder:=xlByRows, _
                      SearchDirection:=xlPrevious, _
                      MatchCase:=False).Row  'Change P1 back to A1 if needed
    Else
        lastrow = 1
    End If

    .Range("A1:AZ" & lastrow).RemoveDuplicates Columns:=Array(16), Header:=xlYes
End With

I have to go through each row visually to prove this works I think, and to rule out that it isn't deleting things that should not be deleted - but this seems to get rid of the 'double duplicates' (4-6 lines items instead of 2 like the other duplicates).

Upvotes: 0

tiedied61
tiedied61

Reputation: 46

1) you are only clearing as many rows as your have columns, not rows Also, you may not be clearing anything, so use the UsedRange.Rows for proper rowcount

This line ...

Sheets("Sheet2").Rows(1 & ":" & Sheets("Sheet2").Columns.Count).ClearContents 

Should read ...

Sheets("Sheet2").Rows(1 & ":" & Sheets("Sheet2").UsedRange.Rows.Count).ClearContents 

Without properly clearing the old data, unpredictable results may occur.

2) Excel VBA seems rather quirky in that many things won't work correctly without specifically "selecting" the object(s) in question AND specifiying complete (not columnar) ranges

3) I also prefer to leave out the header row (note the "A2") and pass Header:=xlNo

4) Unless you have more than 625 columns, BZ should be far enough

So add this to your code ...

Dim tmpAddress as String

tmpAddress = "A2:BZ" & Worksheets("Sheet2").UsedRange.Rows.Count

Worksheets("Sheet2").Activate
Worksheets("Sheet2").Range(tmpAddress).RemoveDuplicates Columns:=15, Header:=xlNo 

Hope this helps :)

Upvotes: 0

Related Questions