JNC
JNC

Reputation: 33

How to highlight rows that have duplicate values in 2 or more columns

I am trying to figure out how to best do this. I have a sheet which has about 44 columns and around 64,000 rows. The columns have different customer data points such as name, date of birth, phone number, and e-mail (these are the most relevant columns for my purposes). I was wondering how I could sort by or highlight the rows in which at least three column data points match, to show a duplicate record for a customer. To explain clearly, I only want to highlight the rows that are duplicates based on at least 3 columns (the name column (the constant) and either phone number or DOB or e-mail.)

For example:

enter image description here

In the above, John Smith matched based on DOB alone. Lisa winters based on email, and Stephanie wright based on both DOB and email.

Now that I am looking at it more I will combine first and last name into one column so it will only have to match 2 or more columns instead of three.

I posted in superuser and all I got was countifs which seems like a start, but I seem to need to incorporate " and, or" logic as well?

Any help with specific formulas is greatly appreciated!

Upvotes: 1

Views: 646

Answers (2)

Tom Sharpe
Tom Sharpe

Reputation: 34370

Just for comparison, this would be the array-type approach but as @Luuklag rightly says, it could be slow with 64K rows of data, although it does give complete results

=SUMPRODUCT(($A2<>"")*($A2=$A$2:$A$10)*($B2=$B$2:B$10)*SIGN((($C2=$C$2:$C$10)+($D2=$D$2:$D$10)+($E2=$E$2:$E$10))))>1

So this tests all rows to see if there is more than one which agrees with the current row on last name, first name, and one of DOB, phone and email, assuming your data is in the first five columns and omitting any rows where last name is blank. Adjust ranges to suit.

This is too slow on 64K rows. A little better is to use SUMIFS

=(COUNTIFS($A$2:$A$64000,$A2,$B$2:$B$64000,$B2,$C$2:$C$64000,$C2)
+COUNTIFS($A$2:$A$64000,$A2,$B$2:$B$64000,$B2,$D$2:$D$64000,$D2)
+COUNTIFS($A$2:$A$64000,$A2,$B$2:$B$64000,$B2,$E$2:$E$64000,$E2))>3

Upvotes: 1

Luuklag
Luuklag

Reputation: 3914

You should sort your data on name. And then create an extra helper column that binary indicates wether it is a duplicate or not.

You could simply use a formula in F2 like:

=IF(AND($A2=$A1,$B2=$B1,OR($C2=$C1,$D2=$D1,$E2=$E1)),1,0)

This will give you 1's in column F for those that are a duplicate of the row above based on both first and last name, and at least one other column. This isn't an completely ideal situation ofcourse, as it doesn't always show a duplication. For example:

If there are 3 entries with the same name, and the first has all other fields populated. The second entry has only name and email. And is considered a match to the first entry. The third entry has only name and DOB, and isn't considered a match to the second entry, as only the names match.

To circumvent this you would require the use of INDEX(MATCH()), however that is quiet the burden on your pc, especially if you are going to use it recursively on 64K entries.

Upvotes: 1

Related Questions