Reputation: 4389
I have a CSV file with multiple entries. Example csv:
user, phone, email
joe, 123, [email protected]
mary, 456, [email protected]
ed, 123, [email protected]
I'm trying to remove the duplicates by a specific column in the CSV however with the code below I'm getting an "list index out of range". I thought by comparing row[1]
with newrows[1]
I would find all duplicates and only rewrite the unique entries in file2.csv
. This doesn't work though and I can't understand why.
f1 = csv.reader(open('file1.csv', 'rb'))
newrows = []
for row in f1:
if row[1] not in newrows[1]:
newrows.append(row)
writer = csv.writer(open("file2.csv", "wb"))
writer.writerows(newrows)
My end result is to have a list that maintains the sequence of the file (set
won't work...right?) which should look like this:
user, phone, email
joe, 123, [email protected]
mary, 456, [email protected]
Upvotes: 5
Views: 9309
Reputation: 421
I know this is long gone, but in case someone else lands here, please find a better answer here.
Upvotes: 0
Reputation: 71
This solution worked for me but since I'm new to Python, can someone explain the last bit
writer.writerow(row)
phone_numbers.add( row[1] )
I understand that writer looks at the file "file2.csv" and writes each row. What phone_numbers.add( row[1] ) does to eliminate duplicates?
Upvotes: 0
Reputation: 45039
row[1]
refers to the second column in the current row (phone). That's all well in good.
However, you newrows.append(row)
add the entire row to the list.
When you check row[1] in newrows
you are checking the individual phone number against a list of complete rows. But that's not what you want to do. You need to check against a list or set of just phone numbers. For that, you probably want to keep track of the rows and a set of the observed phone numbers.
Something like:
f1 = csv.reader(open('file1.csv', 'rb'))
writer = csv.writer(open("file2.csv", "wb"))
phone_numbers = set()
for row in f1:
if row[1] not in phone_numbers:
writer.writerow(row)
phone_numbers.add( row[1] )
Upvotes: 9