Reputation: 23
How can I remove duplicate rows from a csv file based on two columns, with one of those columns determining a match using a regular expression and grouping by the first field (IPAddress). Finally add a count field to the row to count duplicate rows:
csv file:
IPAddress, Value1, Value2, Value3
127.0.0.1, Test1ABC, 10, 20
127.0.0.1, Test2ABC, 20, 30
127.0.0.1, Test1ABA, 30, 40
127.0.0.1, Value1BBA, 40, 50
127.0.0.1, Value1BBA, 40, 50
127.0.0.2, Test1ABC, 10, 20
127.0.0.2, Value1AAB, 20, 30
127.0.0.2, Value2ABA, 30, 40
127.0.0.2, Value1BBA, 40, 50
I want to match on IPAddress and Value1 (where Value1 is a match if first 5 char are a match).
This would give me:
IPAddress, Value1, Value2, Value3, Count
127.0.0.1, Test1ABC, 10, 20, 2
127.0.0.1, Test2ABC, 20, 30, 1
**127.0.0.1, Test1ABA, 30, 40** (Line would be removed but counted)
127.0.0.1, Value1BBA, 40, 50, 2
**127.0.0.1, Value1BBA, 40, 50** (Line would be removed but counted)
127.0.0.2, Test1ABC, 10, 20, 1
127.0.0.2, Value1AAB, 20, 30, 2
127.0.0.2, Value2ABA, 30, 40, 1
**127.0.0.2, Value1BBA, 40, 50** (Line would be removed but counted)
New Output:
IPAddress, Value1, Value2, Value3, Count
127.0.0.1, Test1ABC, 10, 20, 2
127.0.0.1, Test2ABC, 20, 30, 1
127.0.0.1, Value1BBA, 40, 50, 2
127.0.0.2, Test1ABC, 10, 20, 1
127.0.0.2, Value1AAB, 20, 30, 2
127.0.0.2, Value2ABA, 30, 40, 1
I have tried using a set but obviously can't index a set.
entries = set()
writer=csv.writer(open('myfilewithoutduplicates.csv', 'w'), delimiter=',')
for row in list:
key = (row[0], row[1])
if re.match(r"(Test1)", key[1]) not in entries:
entries.add(key)
Pseudo Code ?:
# I want to iterate through rows of a csv file and
if row[0] and row[1][:5] match a previous entry:
remove row
add count
else:
add row
Any help or guidance is really appreciated.
Upvotes: 2
Views: 1877
Reputation: 31070
You could use numpy:
import numpy as np
# import data from file (assume file called a.csv), store as record array:
a = np.genfromtxt('a.csv',delimiter=',',skip_header=1,dtype=None)
# get the first column and first 5 chars of 2nd col store in array p
p=[x+y for x,y in zip(a['f0'],[a['f1'][z][0:6] for z in range(len(a))])]
#compare elements in p, get indexes of unique entries (m)
k,m = np.unique(p, return_index=True)
# use indexes to create new array without dupes
newlist = [a[v] for v in m]
#the count is the difference in lengths of the arrays
count = len(a)-len(newlist)
Upvotes: 0
Reputation: 1123590
You need a dictionary to track the matches. You do not need a regular expression, only the first 5 characters need to be tracked. Store rows by their 'key', comprised of the first column and the first 5 characters of the second, and add a count. You need to count first, then write out the collected rows and counts.
If ordering matters, you can replace the dictionary with collections.OrderedDict()
but otherwise the code is the same:
rows = {}
with open(inputfilename, 'rb') as inputfile:
reader = csv.reader(inputfile)
headers = next(reader) # collect first row as headers for the output
for row in reader:
key = (row[0], row[1][:5])
if key not in rows:
rows[key] = row + [0,]
rows[key][-1] += 1 # count
with open('myfilewithoutduplicates.csv', 'wb') as outputfile:
writer = csv.writer(outputfile)
writer.writerow(headers + ['Count'])
writer.writerows(rows.itervalues())
Upvotes: 1