Matt
Matt

Reputation: 173

Python - How to remove duplicate cells/rows from CSV based on cell

I have a comma separated CSV file of Cities, States, Zip and State Abbreviation. The file has duplicate cites in it with different zip codes. What I need for my output is only 1 city listed and just one zip code, not the same name of a city with all the different zip codes.

Here is what I have in the data (There are 26k lines in the csv file)

Phoenix, Arizona, 87654, AZ
Phoenix, Arizona, 87655, AZ
Tuscon, Arizona, 98754, AZ
Tuscon, Arizona, 98755, AZ
Tuscon, Arizona, 98756, AZ

What I need is to remove the duplicate city rows and just keep one. So my output needs to look like this.

Phoenix, Arizona, 87654, AZ
Tuscon, Arizona, 98756, AZ

Here is the code I have at the moment, that is not working.

import os, sys, csv

f1 = csv.reader(open('originalcities.csv', 'rb'))
writer = csv.writer(open("output_cities.csv", "wb"))
city = set()
for row in f1:
    if row[1] not in city:
        write.writerow(row)
        city.add( row[1] )

I'm not sure if the code is correct for me to get the output I am looking for. When I run this code I get this error.

for row in f1:
_csv.Error: iterator should return strings, not bytes (did you open the file in
text mode?)

Any help is greatly appreciated. Thank you.

Upvotes: 0

Views: 3530

Answers (2)

Fernando
Fernando

Reputation: 7895

What about using set to remove the duplicates?

line_sets = []
full_set = []

with open('file.txt') as inp:
    lines = inp.readlines()

    for i in range(0, len(lines)):
        # strip for precaution
        tokens = [w.strip() for w in lines[i].split(',')[0:2]]
        tmp_set = set(tokens)

        if tmp_set not in line_sets:
            full_set.append(lines[i].split(','))
            line_sets.append(tmp_set)

with open('output.txt', 'w') as out:
    for line in full_set:
        out.write(','.join(line))

output.txt

Phoenix,Arizona,87654,AZ
Tuscon,Arizona,98754,AZ

Upvotes: 0

Hugh Bothwell
Hugh Bothwell

Reputation: 56634

Using row[1] means you are keeping one entry per STATE.

You should be testing on (row[0], row[1]) instead (city/state combination, to avoid confusing e.g. Richmond TX with Richmond VA).

import csv

IN_FILE  = 'originalcities.csv'
OUT_FILE = 'output_cities.csv'

def main():
    with open(IN_FILE, 'rb') as inf, open(OUT_FILE, 'wb') as outf:
        incsv, outcsv  = csv.reader(inf), csv.writer(outf)

        # read data and de-duplicate by city and state
        citystates = set()
        for row in incsv:
            citystate = tuple(row[0:2])
            if citystate not in citystates:
                outcsv.writeline(row)
                citystates.add(citystate)

if __name__=="__main__":
    main()

** EDIT: **

Based on @Fernando's suggestion, here is a version that does not need a set, it just compares consecutive lines (this requires that the input data already be in sorted order, or at least that all occurrences of each city/state combination be together).

import csv
from itertools import groupby

IN_FILE   = 'originalcities.csv'
OUT_FILE  = 'output_cities.csv'

def main():
    with open(IN_FILE, 'rb') as inf, open(OUT_FILE, 'wb') as outf:
        incsv  = csv.reader(inf)
        outcsv = csv.writer(outf)
        # Read data and de-duplicate by city and state
        # ! assumes data is already in sorted order !
        for citystate,rows in groupby(incsv, key=lambda row: row[0:2]):
            outcsv.writerow(next(rows))

if __name__=="__main__":
    main()

Upvotes: 1

Related Questions