Reputation: 1
We have a large excel file with 120000 rows of data where we need to find duplicated records.
The first records is the Master record to match against the list of duplicated records. We need to confirm whether the duplicate are true and highlight them.
Each record has a number of rows depending on the number of email addresses on file and also physical addresses.
Desired results are: 1. If the name in the Master record matches the name in the duplicated record, then find a matching any addresses from the master record with any of the addresses in the duplicated records or matching the email addresses from master record with duplicated email addresses in the duplicated records. Then put a YES in the duplicates column for either matches.
There may be more than one duplicated record found and this is shown in column A
If the names do not match, but everything else matches, we still want to put a NO so that we can do a manual check when going through the excel document.
This is the formula that is being used in column E. It only works if the first address in the master record has a match in the duplicated records. If it is the 2nd address in the Master record that matches, then it doesn't work.
=IF(OR(C2&B2=B2,C2&B2=C2,D2&C2=D2, D2&C2=C2,B2="Addresses",C2="Name"),"",IF(SUM(--(C2&B2&D2=C:C&B:B&D:D))>1,"YES","NO"))
Can you help with fixing the formula and also assisting with highlighting the duplicates?
A B C D E
Master Addresses Name Email address Duplicate
255812 1 test rd Sydney John Doe [email protected] YES
1 test rd Sydney [email protected]
PO BOX 2 Sydney
Addresses Name Email address
421504 1 test rd Sydney John Doe [email protected] YES
Duplicate
records
found: 1
Master Addresses Name Email address
455444 1111 House rd Sydney Steven Jacks [email protected] NO
134 Smart St Sydney
Addresses Name Email address
53421 134 Smart St Sydney Steven Jacks [email protected] NO
1111 House rd Sydney [email protected]
Level 5, Kings St Syd
134 Smart St Sydney
Addresses Name Email address
13232 Level 5, Kings St Syd Steven R Jacks [email protected] NO
Duplicate
records
found: 2
Upvotes: 0
Views: 2073
Reputation: 8941
Elaborating on my comment to your question, here's a way to kind of "normalize your file a bit before proceeding with finding duplicates:
Save the original sort order of the file (Column F - no formula)
Assuming that no Master has more than 999 attached addresses, calculate a key for all real data records and save as value
G2 and down: =IF(ISNUMBER(A2);A2;IF(AND(A2="";AND(B2<>"";B2<>"Addresses");ISNUMBER(G1));G1+0,001;""))
H2 and down: =G2
, then copy/paste as value
now sort according to your duplication criteria (by name, address)
then add a dup finder formula like "if this row key = prev row key then this counter = prev counter + 1 else this counter = 1", e.g.
I2 and down: =IF(ISNUMBER(H2);IF(AND(C2<>"";C2=C1;B2=B1);I1+1;1);"")
now you can
etc ...
and we're not taking into account different spellings like abbreviations & capitals, add'l white spaces or interpunctations (Rd vs Rd. vs road, Road, 1 vs 1. vs 1.,) - middle initials in names, etc. etc., all of which are the real pain for cleaning address files
you can repeat the above for a different sort order (e.g. address + name)
Remember before any sorting you should convert formulas into values!
now ... this all may sound rather complicated ... therefore ... if this is not a one-off but a recurring exercise, I'd think about VBA ...
Upvotes: 0