goldisfine
goldisfine

Reputation: 4850

Removing duplicate rows

[Similar Post][1]

I have a tab-delimited spreadsheet and I'm trying to figure out a way to remove duplicate entries. Here's some made-up data that has the same form as the data in the spreadsheet:

name    phone   email   website 
Diane Grant Albrecht M.S.           
"Lannister G. Cersei M.A.T., CEP"   111-222-3333    [email protected]  www.got.com
Argle D. Bargle Ed.M.           
Sam D. Man Ed.M.    000-000-1111    [email protected]   www.daManWithThePlan.com
Sam D. Man Ed.M.    
Sam D. Man Ed.M.    111-222-333     [email protected]   www.daManWithThePlan.com
D G Bamf M.S.           
Amy Tramy Lamy Ph.D.    

I would like to have the duplicate rows for Sam D. Man merged into one that keeps the two phone numbers but doesn't store two of the same email and two of the same website.

The way I thought about doing this was to store the previous row and compare the names. If the names match, then compare the phone numbers. If the phone numbers don't match, append to the first row. Then compare the emails. If the emails don't match, append to the first row. And then compare the websites. If the websites don't match, then append the second website to the first. Then delete the second row.

  1. I don't know how to delete a row. The other posts seem to avoid actually deleting rows by writing rows to a new file. But I think this is problematic for my case, because I do not want to write the rows with the same names twice.
  2. Is there a more efficient means to loop through? Nested for loops are taking a while.
    1. And I can see myself running into issues with indexing over the limit...

Here's my code:

with(open('ieca_first_col_fake_text.txt', 'rU')) as f:
    sheet = csv.DictReader(f, delimiter = '\t')

# This function takes a tab-delim csv and merges the ones with the same name but different phone / email / websites.
def merge_duplicates(sheet):

    # Since duplicates immediately follow, store adjacent and compare. If the same name, append phone number 
    for row in sheet:
        for other_row in sheet:
            if row['name'] == other_row['name']:
                if row['email'] != other_row['email']:
                    row['email'].append(other_row['email'])
                if row['website'] != other_row['website']:
                    row['website'].append(other_row['website'])

    # code to remove duplicate row
    # delete.() or something...

merge_duplicates(sheet)

Upvotes: 0

Views: 1121

Answers (1)

erewok
erewok

Reputation: 7835

In this case, depending on how large your 'sheet' is, it might be useful to turn your csv.DictReader object into a list, so that you can slice it and compare the various fields that way. I think your logic is correct too when you say the following:

The way I thought about doing this was to store the previous row and compare the names. 1) If the names match, then 2) compare the phone numbers. If the phone numbers don't match, 3) append to the first row. 4) Then compare the emails. 5) If the emails don't match, append to the first row. 6) And then compare the websites. 7) If the websites don't match, then append the second website to the first. Then delete the second row. (not necessary, just skip it)

Here's my (quickly written before work) recommendation:

with(open('ieca_first_col_fake_text.txt', 'rU')) as f:
    sheet = csv.DictReader(f, delimiter = '\t')

def merge_duplicates(sheet):
    mysheet = list(sheet)

    for rowvalue, row in enumerate(mysheet):
        try:
            for other_row in mysheet[rowvalue+1:]              

                if row['name'] == other_row['name']: # check if it's a duplicate name
                   other_row['delete'] = "duplicate row" #add delete key for later sorting
                   if row['email'] != other_row['email']:
                       row['alt_email'] = other_row['email'] # add new "alt_email" key to original row
                   # test other fields here...
        except IndexError:
            print("We're at the end now") 

After that, you'd need to iterate through and ignore every row with a "delete" key in it and keep only those without.

Upvotes: 1

Related Questions