serk
serk

Reputation: 4389

Python: Removing duplicate CSV entries

I have a CSV file with multiple entries. Example csv:

user, phone, email
joe, 123, [email protected]
mary, 456, [email protected]
ed, 123, [email protected]

I'm trying to remove the duplicates by a specific column in the CSV however with the code below I'm getting an "list index out of range". I thought by comparing row[1] with newrows[1] I would find all duplicates and only rewrite the unique entries in file2.csv. This doesn't work though and I can't understand why.

f1 = csv.reader(open('file1.csv', 'rb'))
    newrows = []
    for row in f1:
        if row[1] not in newrows[1]:
            newrows.append(row)
    writer = csv.writer(open("file2.csv", "wb"))
    writer.writerows(newrows)

My end result is to have a list that maintains the sequence of the file (set won't work...right?) which should look like this:

user, phone, email
joe, 123, [email protected]
mary, 456, [email protected]

Upvotes: 5

Views: 9309

Answers (3)

Felix Ongati
Felix Ongati

Reputation: 421

I know this is long gone, but in case someone else lands here, please find a better answer here.

Upvotes: 0

Python_Rookie
Python_Rookie

Reputation: 71

This solution worked for me but since I'm new to Python, can someone explain the last bit

writer.writerow(row)
phone_numbers.add( row[1] )

I understand that writer looks at the file "file2.csv" and writes each row. What phone_numbers.add( row[1] ) does to eliminate duplicates?

Upvotes: 0

Winston Ewert
Winston Ewert

Reputation: 45039

row[1] refers to the second column in the current row (phone). That's all well in good.

However, you newrows.append(row) add the entire row to the list.

When you check row[1] in newrows you are checking the individual phone number against a list of complete rows. But that's not what you want to do. You need to check against a list or set of just phone numbers. For that, you probably want to keep track of the rows and a set of the observed phone numbers.

Something like:

f1 = csv.reader(open('file1.csv', 'rb'))
writer = csv.writer(open("file2.csv", "wb"))
phone_numbers = set()
for row in f1:
    if row[1] not in phone_numbers:
        writer.writerow(row)
        phone_numbers.add( row[1] )

Upvotes: 9

Related Questions