Patrick
Patrick

Reputation: 1

Match duplicated rows in excel within a range

We have a large excel file with 120000 rows of data where we need to find duplicated records.

The first records is the Master record to match against the list of duplicated records. We need to confirm whether the duplicate are true and highlight them.

Each record has a number of rows depending on the number of email addresses on file and also physical addresses.

Desired results are: 1. If the name in the Master record matches the name in the duplicated record, then find a matching any addresses from the master record with any of the addresses in the duplicated records or matching the email addresses from master record with duplicated email addresses in the duplicated records. Then put a YES in the duplicates column for either matches.

  1. Highlight the matching addresses in a colour, highlight the matching names with another colour, highlight the matching email addresses with another colour.

There may be more than one duplicated record found and this is shown in column A

If the names do not match, but everything else matches, we still want to put a NO so that we can do a manual check when going through the excel document.

This is the formula that is being used in column E. It only works if the first address in the master record has a match in the duplicated records. If it is the 2nd address in the Master record that matches, then it doesn't work.

=IF(OR(C2&B2=B2,C2&B2=C2,D2&C2=D2, D2&C2=C2,B2="Addresses",C2="Name"),"",IF(SUM(--(C2&B2&D2=C:C&B:B&D:D))>1,"YES","NO"))

Can you help with fixing the formula and also assisting with highlighting the duplicates?

A            B                    C                       D                    E                            
Master   Addresses            Name                   Email address        Duplicate
255812   1 test rd Sydney     John Doe               [email protected]   YES
         1 test rd Sydney                            [email protected] 
         PO BOX 2 Sydney            

         Addresses            Name                   Email address  
421504   1 test rd Sydney     John Doe               [email protected]   YES
Duplicate
records
found: 1                

Master  Addresses             Name                    Email address 
455444  1111 House rd Sydney  Steven Jacks            [email protected]     NO
        134 Smart St Sydney         

        Addresses             Name                    Email address 
53421   134 Smart St Sydney   Steven Jacks            [email protected]     NO
        1111 House rd Sydney                          [email protected]  
        Level 5, Kings St Syd                       
        134 Smart St Sydney         

        Addresses              Name                   Email address 
13232   Level 5, Kings St Syd  Steven R Jacks         [email protected]  NO

Duplicate
records 
found: 2

Upvotes: 0

Views: 2073

Answers (1)

MikeD
MikeD

Reputation: 8941

Elaborating on my comment to your question, here's a way to kind of "normalize your file a bit before proceeding with finding duplicates:

Save the original sort order of the file (Column F - no formula)

Assuming that no Master has more than 999 attached addresses, calculate a key for all real data records and save as value

G2 and down: =IF(ISNUMBER(A2);A2;IF(AND(A2="";AND(B2<>"";B2<>"Addresses");ISNUMBER(G1));G1+0,001;""))

H2 and down: =G2, then copy/paste as value

enter image description here

now sort according to your duplication criteria (by name, address)

then add a dup finder formula like "if this row key = prev row key then this counter = prev counter + 1 else this counter = 1", e.g.

I2 and down: =IF(ISNUMBER(H2);IF(AND(C2<>"";C2=C1;B2=B1);I1+1;1);"")

enter image description here

now you can

  • filter column I for all values > 1 to identify duplicates. you'll find only John Doe at 1 test rd Sydney ...
  • feed back the result via VLOOKUP on the KeyValue column to your original table
  • etc ...

    and we're not taking into account different spellings like abbreviations & capitals, add'l white spaces or interpunctations (Rd vs Rd. vs road, Road, 1 vs 1. vs 1.,) - middle initials in names, etc. etc., all of which are the real pain for cleaning address files

you can repeat the above for a different sort order (e.g. address + name)

Remember before any sorting you should convert formulas into values!

now ... this all may sound rather complicated ... therefore ... if this is not a one-off but a recurring exercise, I'd think about VBA ...

Upvotes: 0

Related Questions