Reputation: 4850
[Similar Post][1]
I have a tab-delimited spreadsheet and I'm trying to figure out a way to remove duplicate entries. Here's some made-up data that has the same form as the data in the spreadsheet:
name phone email website
Diane Grant Albrecht M.S.
"Lannister G. Cersei M.A.T., CEP" 111-222-3333 [email protected] www.got.com
Argle D. Bargle Ed.M.
Sam D. Man Ed.M. 000-000-1111 [email protected] www.daManWithThePlan.com
Sam D. Man Ed.M.
Sam D. Man Ed.M. 111-222-333 [email protected] www.daManWithThePlan.com
D G Bamf M.S.
Amy Tramy Lamy Ph.D.
I would like to have the duplicate rows for Sam D. Man merged into one that keeps the two phone numbers but doesn't store two of the same email and two of the same website.
The way I thought about doing this was to store the previous row and compare the names. If the names match, then compare the phone numbers. If the phone numbers don't match, append to the first row. Then compare the emails. If the emails don't match, append to the first row. And then compare the websites. If the websites don't match, then append the second website to the first. Then delete the second row.
Here's my code:
with(open('ieca_first_col_fake_text.txt', 'rU')) as f:
sheet = csv.DictReader(f, delimiter = '\t')
# This function takes a tab-delim csv and merges the ones with the same name but different phone / email / websites.
def merge_duplicates(sheet):
# Since duplicates immediately follow, store adjacent and compare. If the same name, append phone number
for row in sheet:
for other_row in sheet:
if row['name'] == other_row['name']:
if row['email'] != other_row['email']:
row['email'].append(other_row['email'])
if row['website'] != other_row['website']:
row['website'].append(other_row['website'])
# code to remove duplicate row
# delete.() or something...
merge_duplicates(sheet)
Upvotes: 0
Views: 1121
Reputation: 7835
In this case, depending on how large your 'sheet' is, it might be useful to turn your csv.DictReader object into a list, so that you can slice it and compare the various fields that way. I think your logic is correct too when you say the following:
The way I thought about doing this was to store the previous row and compare the names. 1) If the names match, then 2) compare the phone numbers. If the phone numbers don't match, 3) append to the first row. 4) Then compare the emails. 5) If the emails don't match, append to the first row. 6) And then compare the websites. 7) If the websites don't match, then append the second website to the first. Then delete the second row. (not necessary, just skip it)
Here's my (quickly written before work) recommendation:
with(open('ieca_first_col_fake_text.txt', 'rU')) as f:
sheet = csv.DictReader(f, delimiter = '\t')
def merge_duplicates(sheet):
mysheet = list(sheet)
for rowvalue, row in enumerate(mysheet):
try:
for other_row in mysheet[rowvalue+1:]
if row['name'] == other_row['name']: # check if it's a duplicate name
other_row['delete'] = "duplicate row" #add delete key for later sorting
if row['email'] != other_row['email']:
row['alt_email'] = other_row['email'] # add new "alt_email" key to original row
# test other fields here...
except IndexError:
print("We're at the end now")
After that, you'd need to iterate through and ignore every row with a "delete" key in it and keep only those without.
Upvotes: 1