Remove duplicate rows from csv file based on 2 columns with regex in Python

Question

How can I remove duplicate rows from a csv file based on two columns, with one of those columns determining a match using a regular expression and grouping by the first field (IPAddress). Finally add a count field to the row to count duplicate rows:

csv file:

IPAddress, Value1, Value2, Value3
127.0.0.1, Test1ABC, 10, 20
127.0.0.1, Test2ABC, 20, 30
127.0.0.1, Test1ABA, 30, 40
127.0.0.1, Value1BBA, 40, 50
127.0.0.1, Value1BBA, 40, 50
127.0.0.2, Test1ABC, 10, 20
127.0.0.2, Value1AAB, 20, 30
127.0.0.2, Value2ABA, 30, 40
127.0.0.2, Value1BBA, 40, 50

I want to match on IPAddress and Value1 (where Value1 is a match if first 5 char are a match).

This would give me:

IPAddress, Value1, Value2, Value3, Count
127.0.0.1, Test1ABC, 10, 20, 2
127.0.0.1, Test2ABC, 20, 30, 1
**127.0.0.1, Test1ABA, 30, 40** (Line would be removed but counted)
127.0.0.1, Value1BBA, 40, 50, 2
**127.0.0.1, Value1BBA, 40, 50** (Line would be removed but counted)
127.0.0.2, Test1ABC, 10, 20, 1
127.0.0.2, Value1AAB, 20, 30, 2
127.0.0.2, Value2ABA, 30, 40, 1
**127.0.0.2, Value1BBA, 40, 50** (Line would be removed but counted)

New Output:

IPAddress, Value1, Value2, Value3, Count
127.0.0.1, Test1ABC, 10, 20, 2
127.0.0.1, Test2ABC, 20, 30, 1
127.0.0.1, Value1BBA, 40, 50, 2
127.0.0.2, Test1ABC, 10, 20, 1
127.0.0.2, Value1AAB, 20, 30, 2
127.0.0.2, Value2ABA, 30, 40, 1

I have tried using a set but obviously can't index a set.

entries = set()
writer=csv.writer(open('myfilewithoutduplicates.csv', 'w'), delimiter=',')
    for row in list:
    key = (row[0], row[1])
        if re.match(r"(Test1)", key[1]) not in entries:
        entries.add(key)

Pseudo Code ?:

# I want to iterate through rows of a csv file and
if row[0] and row[1][:5] match a previous entry:
    remove row
    add count
else:
    add row

Any help or guidance is really appreciated.

Martijn Pieters · Accepted Answer

You need a dictionary to track the matches. You do not need a regular expression, only the first 5 characters need to be tracked. Store rows by their 'key', comprised of the first column and the first 5 characters of the second, and add a count. You need to count first, then write out the collected rows and counts.

If ordering matters, you can replace the dictionary with collections.OrderedDict() but otherwise the code is the same:

rows = {}

with open(inputfilename, 'rb') as inputfile:
    reader = csv.reader(inputfile)
    headers = next(reader)  # collect first row as headers for the output
    for row in reader:
        key = (row[0], row[1][:5])
        if key not in rows:
            rows[key] = row + [0,]
        rows[key][-1] += 1  # count

with open('myfilewithoutduplicates.csv', 'wb') as outputfile:
    writer = csv.writer(outputfile)
    writer.writerow(headers + ['Count'])
    writer.writerows(rows.itervalues())

Remove duplicate rows from csv file based on 2 columns with regex in Python

Answers (2)

Related Questions