Reputation: 527
I have a csv file and I have duplicate as well as unique data getting add to it on a daily basis. This involves too many duplicates. I have to remove the duplicates based on specific columns. For eg:
csvfile1:
title1 title2 title3 title4 title5
abcdef 12 13 14 15
jklmn 12 13 56 76
abcdef 12 13 98 89
bvnjkl 56 76 86 96
Now, based on title1, title2 and title3 I have to remove duplicates and add the unique entries in a new csv file. As you can see abcdef row is not unique and repeats based on title1,title2 and title3 so it should be removedand the output should look like:
Expected Output CSV File:
title1 title2 title3 title4 title5
jklmn 12 13 56 76
bvnjkl 56 76 86 96
My tried code is here below:CSVINPUT file import csv
f = open("1.csv", 'a+')
writer = csv.writer(f)
writer.writerow(("t1", "t2", "t3"))
a =[["a", 'b', 'c'], ["g", "h", "i"],['a','b','c']] #This list is changed daily so new and duplicates data get added daily
for i in range(2):
writer.writerow((a[i]))
f.close()
Duplicate removal script:
import csv
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line not in seen: continue # skip duplicate
out_file.write(line)
My Output: 2.csv:
t1 t2 t3
a b c
g h i
Now, I do not want a b c in the 2.csv based on t1 and t2 only the unique g h i based on t1 and t2
Upvotes: 1
Views: 1652
Reputation: 90859
Some issues in your code -
In the python file to create the csv file, you are only iterating till - range(2)
, range
is not inclusive, so it only writes the first two columns into the csv, not the third one, you can directly iterate over the csv, rather than iterating over each element. Also, you do not need that many brackets in writer.writerow()
, Example -
for i in a:
writer.writerow(i)
In your duplicate removal script, you are actually never adding anything into seen()
, so you would never end up removing anything. When you want to remove duplicates based on a subset of elements of a list, you can just add those elements (in a specific order) to the seen
set (as a tuple) , not a list, since set()
only accepts hashable elements. And then when checking for containment in set, check only that subset that you had added. Example -
import csv
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
seen = set()
seentwice = set()
reader = csv.reader(in_file)
writer = csv.writer(out_file)
rows = []
for row in reader:
if (row[0],row[1]) in seen:
seentwice.add((row[0],row[1]))
seen.add((row[0],row[1]))
rows.append(row)
for row in rows:
if (row[0],row[1]) not in seentwice:
writer.writerow(row)
This would complete remove any rows which is duplicated based on first and second column . It would not even store a single row for such rows, I am guessing that is what you want.
seen
- set - This is used to store rows that we have already seen.
seentwice
- set - This set is only populated with a row, if we encounter a row that was already previously added to seen
, so that means that that row is duplicated.
Now in the end, we only want to write rows
that are not inside seentwice
, since any row
in seentwice
means that it is duplicated , that row has atleast two different rows with similar values at row[0]
and row[1]
.
Upvotes: 2