Reputation: 173
I have a comma separated CSV file of Cities, States, Zip and State Abbreviation. The file has duplicate cites in it with different zip codes. What I need for my output is only 1 city listed and just one zip code, not the same name of a city with all the different zip codes.
Here is what I have in the data (There are 26k lines in the csv file)
Phoenix, Arizona, 87654, AZ
Phoenix, Arizona, 87655, AZ
Tuscon, Arizona, 98754, AZ
Tuscon, Arizona, 98755, AZ
Tuscon, Arizona, 98756, AZ
What I need is to remove the duplicate city rows and just keep one. So my output needs to look like this.
Phoenix, Arizona, 87654, AZ
Tuscon, Arizona, 98756, AZ
Here is the code I have at the moment, that is not working.
import os, sys, csv
f1 = csv.reader(open('originalcities.csv', 'rb'))
writer = csv.writer(open("output_cities.csv", "wb"))
city = set()
for row in f1:
if row[1] not in city:
write.writerow(row)
city.add( row[1] )
I'm not sure if the code is correct for me to get the output I am looking for. When I run this code I get this error.
for row in f1:
_csv.Error: iterator should return strings, not bytes (did you open the file in
text mode?)
Any help is greatly appreciated. Thank you.
Upvotes: 0
Views: 3530
Reputation: 7895
What about using set
to remove the duplicates?
line_sets = []
full_set = []
with open('file.txt') as inp:
lines = inp.readlines()
for i in range(0, len(lines)):
# strip for precaution
tokens = [w.strip() for w in lines[i].split(',')[0:2]]
tmp_set = set(tokens)
if tmp_set not in line_sets:
full_set.append(lines[i].split(','))
line_sets.append(tmp_set)
with open('output.txt', 'w') as out:
for line in full_set:
out.write(','.join(line))
Phoenix,Arizona,87654,AZ
Tuscon,Arizona,98754,AZ
Upvotes: 0
Reputation: 56634
Using row[1]
means you are keeping one entry per STATE.
You should be testing on (row[0], row[1])
instead (city/state combination, to avoid confusing e.g. Richmond TX with Richmond VA).
import csv
IN_FILE = 'originalcities.csv'
OUT_FILE = 'output_cities.csv'
def main():
with open(IN_FILE, 'rb') as inf, open(OUT_FILE, 'wb') as outf:
incsv, outcsv = csv.reader(inf), csv.writer(outf)
# read data and de-duplicate by city and state
citystates = set()
for row in incsv:
citystate = tuple(row[0:2])
if citystate not in citystates:
outcsv.writeline(row)
citystates.add(citystate)
if __name__=="__main__":
main()
** EDIT: **
Based on @Fernando's suggestion, here is a version that does not need a set, it just compares consecutive lines (this requires that the input data already be in sorted order, or at least that all occurrences of each city/state combination be together).
import csv
from itertools import groupby
IN_FILE = 'originalcities.csv'
OUT_FILE = 'output_cities.csv'
def main():
with open(IN_FILE, 'rb') as inf, open(OUT_FILE, 'wb') as outf:
incsv = csv.reader(inf)
outcsv = csv.writer(outf)
# Read data and de-duplicate by city and state
# ! assumes data is already in sorted order !
for citystate,rows in groupby(incsv, key=lambda row: row[0:2]):
outcsv.writerow(next(rows))
if __name__=="__main__":
main()
Upvotes: 1