user7830303
user7830303

Reputation:

Python - remove duplicate entries from csv file

I have a big poems.csv file with entries like this:

"
this is a good poem. 
",1

"  
this is a bad poem.    
",0

"
this is a good poem. 
",1

"  
this is a bad poem.    
",0

and I would like to remove duplicates from it:

If the file did not have the binary classifier I could just remove duplicate lines, like this:

with open(data_in,'r') as in_file, open(data_out,'w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate
        seen.add(line)
        out_file.write(line)

But this would remove all classification as well. How do I remove duplicate entries keeping 0s and 1s?

expected output:

"
this is a good poem. 
",1

"  
this is a bad poem.    
",0

Upvotes: 0

Views: 5022

Answers (2)

Mad Physicist
Mad Physicist

Reputation: 114230

You can easily add both portions of the line to the set. Assuming your "line" consists of a string and an integer (or two strings), a tuple of both elements can be a valid set element. tuple is immutable, and therefore hashable, and can be added to a set.

Splitting the line would be much easier using the csv.reader class, since it would allow you to read multiline poems as a single line, etc.

import csv

with open(data_in, 'r', newline='') as in_file, open(data_out, 'w', newline='') as out_file:
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    seen = set() # set for fast O(1) amortized lookup
    for row in reader:
        row = tuple(row)
        if row in seen: continue # skip duplicate
        seen.add(row)
        writer.writerow(row)

Since you definitely have a bunch of multiline values in your file, using newline='' is crucial for both input and output because it delegates line splitting to the csv classes.

The advantage of doing it this way over using pandas or another library that preloads the entire file is that it avoids loading more than one duplicate poem into memory at one time. One copy of each poem will be retained in the set, but for a very large file with a lot of duplicates, this solution is much closer to optimal.

We can test with the following file:

"Error 404:
Your Haiku could not be found.
Try again later.", 0
"Error 404:
Your Haiku could not be found.
Try again later.", 1
"Error 404:
Your Haiku could not be found.
Try again later.", 0
"Error 404:
Your Haiku could not be found.
Try again later.", 1

The output is as follows:

"Error 404:
Your Haiku could not be found.
Try again later.", 0
"Error 404:
Your Haiku could not be found.
Try again later.", 1

Note about Python 2

The argument newline does not exist in the Python 2 version of open. This is not going to be a problem on most operating systems since line endings will be internally consistent between the input and output files. Rather than specifying newline='', the Python 2 version of csv requests that files be opened with binary mode.

UPDATE

You have indicated that the behavior of your own answer is not 100% correct. It appears that your data makes it a perfectly valid approach, so I am retaining the previous portion of my answer.

To be able to filter only by poem, ignoring (but preserving) the binary classifier of the first occurrence, you don't need to change much in your code:

import csv

with open(data_in, 'r', newline='') as in_file, open(data_out, 'w', newline='') as out_file:
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    seen = set() # set for fast O(1) amortized lookup
    for row in reader:
        if row[0] in seen: continue # skip duplicate
        seen.add(row[0])
        writer.writerow(row)

Since the zero classifier appears first in the file, the output for the test case above would be:

"Error 404:
Your Haiku could not be found.
Try again later.", 0

I mentioned in the comments that you can also retain either the last seen classifier or always a one if it is found. For both of these options, I would recommend using a dict (or and OrderedDict if you want to preserve the original order of the poems) keyed by the poem, with the value being the classifier. A dictionary's keys are basically a set. You would also end up writing the output file after you load the entire input file.

To retain the last classifier seen:

import csv
from collections import OrderedDict

with open(data_in, 'r', newline='') as in_file:
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    seen = OrderedDict() # map for fast O(1) amortized lookup
    for poem, classifier in reader:
        seen[poem] = classifier # Always update to get the latest classifier

with open(data_out, 'w', newline='') as out_file:
    for row in seen.items():
        writer.writerow(row)

seen.items() iterates over tuples containing the key (poem) and value (classifier), which happens to be exactly what you want to write to the file.

The output from this version will have a one classifier since that appears last in the test input above:

"Error 404:
Your Haiku could not be found.
Try again later.", 1

A similar approach will work to retain a 1 classifier if it exists:

import csv
from collections import OrderedDict

with open(data_in, 'r', newline='') as in_file:
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    seen = OrderedDict() # map for fast O(1) amortized lookup
    for poem, classifier in reader:
        if poem not in seen or classifier == '1'
            seen[poem] = classifier

with open(data_out, 'w', newline='') as out_file:
    for row in seen.items():
        writer.writerow(row)

Upvotes: 1

user7830303
user7830303

Reputation:

pandas as pd solved it:

raw_data = pd.read_csv(data_in)
clean_data = raw_data.drop_duplicates()
clean_data.to_csv(data_out)

Upvotes: 2

Related Questions