Reputation: 2421
I have a csv file like this :
column1 column2
john kerry
adam stephenson
ashley hudson
john kerry
etc..
I want to remove duplicates from this file, to get only :
column1 column2
john kerry
adam stephenson
ashley hudson
I wrote this script that removes duplicates based on lastnames, but I need to remove duplicates based on lastnames AND firstname.
import csv
reader=csv.reader(open('myfilewithduplicates.csv', 'r'), delimiter=',')
writer=csv.writer(open('myfilewithoutduplicates.csv', 'w'), delimiter=',')
lastnames = set()
for row in reader:
if row[1] not in lastnames:
writer.writerow(row)
lastnames.add( row[1] )
Upvotes: 12
Views: 18507
Reputation: 2077
You can now use the .drop_duplicates method in pandas. I would do the following:
import pandas as pd
toclean = pd.read_csv('myfilewithduplicates.csv')
deduped = toclean.drop_duplicates([col1,col2])
deduped.to_csv('myfilewithoutduplicates.csv')
Upvotes: 13
Reputation: 37249
A quick way would be to create a unique set of rows using the following technique (adopted from @CedricJulien from this post). You lose the DictWriter
benefit of having the column names stored in each row, but it should work for you case:
>>> import csv
>>> with open('testcsv1.csv', 'r') as f:
... reader = csv.reader(f)
... uniq = [list(tup) for tup in set([tuple(row) for row in reader])]
...
>>> with open('nodupes.csv', 'w') as f:
... writer=csv.writer(f)
... for row in uniq:
... writer.writerow(row)
This uses the same technique used by @CedricJulien, which is a nice one-liner to remove the duplicate rows (defined as the same first and last name). This uses the DictReader
/DictWriter
classes:
>>> import csv
>>> with open('testcsv1.csv', 'r') as f:
... reader = csv.DictReader(f)
... rows = [row for row in reader]
...
>>> uniq = [dict(tup) for tup in set(tuple(person.items()) for person in rows)]
>>> with open('nodupes.csv', 'w') as f:
... headers = ['column1', 'column2']
... writer = csv.DictWriter(f, fieldnames=headers)
... writer.writerow(dict((h, h) for h in headers))
... for row in uniq:
... writer.writerow(row)
...
Upvotes: 1
Reputation: 2991
You're really close. Use those columns as the set entry
entries = set()
for row in reader:
key = (row[0], row[1]) # instead of just the last name
if key not in entries:
writer.writerow(row)
entries.add(key)
Upvotes: 20