Read the csv file and add only new entries in another csv file

Question

I have a csv file and I have duplicate as well as unique data getting add to it on a daily basis. This involves too many duplicates. I have to remove the duplicates based on specific columns. For eg:

csvfile1:

title1 title2 title3 title4 title5
abcdef 12     13     14     15
jklmn  12     13     56     76
abcdef 12     13     98     89
bvnjkl 56     76     86     96

Now, based on title1, title2 and title3 I have to remove duplicates and add the unique entries in a new csv file. As you can see abcdef row is not unique and repeats based on title1,title2 and title3 so it should be removedand the output should look like:

Expected Output CSV File:

title1 title2 title3 title4 title5
jklmn  12     13     56     76
bvnjkl 56     76     86     96

My tried code is here below:CSVINPUT file import csv

f = open("1.csv", 'a+')

writer = csv.writer(f)

writer.writerow(("t1", "t2", "t3"))

a =[["a", 'b', 'c'], ["g", "h", "i"],['a','b','c']] #This list is changed daily so new and duplicates data get added daily


for i in range(2):
    writer.writerow((a[i]))

f.close()

Duplicate removal script:

import csv




with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line not in seen: continue # skip duplicate


        out_file.write(line)

My Output: 2.csv:

t1 t2 t3
a  b  c
g  h  i

Now, I do not want a b c in the 2.csv based on t1 and t2 only the unique g h i based on t1 and t2

Anand S Kumar · Accepted Answer

Some issues in your code -

In the python file to create the csv file, you are only iterating till - range(2) , range is not inclusive, so it only writes the first two columns into the csv, not the third one, you can directly iterate over the csv, rather than iterating over each element. Also, you do not need that many brackets in writer.writerow() , Example -
```
for i in a:
    writer.writerow(i)
```

In your duplicate removal script, you are actually never adding anything into seen() , so you would never end up removing anything. When you want to remove duplicates based on a subset of elements of a list, you can just add those elements (in a specific order) to the seen set (as a tuple) , not a list, since set() only accepts hashable elements. And then when checking for containment in set, check only that subset that you had added. Example -

import csv
with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set()
    seentwice = set()
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    rows = []
    for row in reader:
        if (row[0],row[1]) in seen:
            seentwice.add((row[0],row[1]))
        seen.add((row[0],row[1]))
        rows.append(row)
    for row in rows:
        if (row[0],row[1]) not in seentwice:
            writer.writerow(row)

This would complete remove any rows which is duplicated based on first and second column . It would not even store a single row for such rows, I am guessing that is what you want.

seen - set - This is used to store rows that we have already seen.

seentwice - set - This set is only populated with a row, if we encounter a row that was already previously added to seen , so that means that that row is duplicated.

Now in the end, we only want to write rows that are not inside seentwice , since any row in seentwice means that it is duplicated , that row has atleast two different rows with similar values at row[0] and row[1] .

Read the csv file and add only new entries in another csv file

Answers (1)

Related Questions