Reputation: 405

Removing duplicates on very large datasets

I'm working on a 13.9 GB csv file that contains around 16 million rows and 85 columns. I know there are potentially a few hundred thousand rows that are duplicates. I ran this code to remove them

import pandas

concatDf=pandas.read_csv("C:\\OUT\\Concat EPC3.csv")
nodupl=concatDf.drop_duplicates()
nodupl.to_csv("C:\\OUT\\Concat EPC3- NoDupl.csv",index=0)
low_memory=False

However this runs me into a MemoryError. My ram is 16gb and can't go any higher. Is there a more efficient way of removing duplicates that perhaps does it chunks without me having to break up the csv file into smaller files?

Upvotes: 20

Answers (3)

javidcf

Reputation: 59701

Essentially the same idea as zwer, but checking for equality in rows with the same hash (instead of automatically discarding duplicated hashes).

file_in = "C:\\OUT\\Concat EPC3.csv"
file_out = "C:\\OUT\\Concat EPC3- NoDupl.csv"

with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
    # Skip header
    next(f_in)
    # Find duplicated hashes
    hashes = set()
    hashes_dup = {}
    for row in f_in:
        h = hash(row)
        if h in hashes:
            hashes_dup[h] = set()
        else:
            hashes.add(h)
    del hashes
    # Rewind file
    f_in.seek(0)
    # Copy header
    f_out.write(next(f_in))
    # Copy non repeated lines
    for row in f_in:
        h = hash(row)
        if h in hashes_dup:
            dups = hashes_dup[h]
            if row in dups:
                continue
            dups.add(row)
        f_out.write(row)

Upvotes: 10

zwer

Reputation: 25789

The simplest solution would be creating a hash table for each line in the file - storing 16M hashes in your working memory shouldn't be a problem (depends on the hash size, tho) - then you can iterate over your file again and make sure that you write down only one occurrence of each hash. You don't even need to parse your CSV nor you need Pandas.

import hashlib

with open("input.csv", "r") as f_in, \
        open("output.csv", "w") as f_out:
    seen = set()  # a set to hold our 'visited' lines
    for line in f_in:  # iterate over the input file line by line
        line_hash = hashlib.md5(line.encode()).digest()  # hash the value
        if line_hash not in seen:  # we're seeing this line for the first time
            seen.add(line_hash)  # add it to the hash table
            f_out.write(line)  # write the line to the output

This uses MD5 as a hash so it would take about 16B + set overhead per line, but that's still far less than storing everything in the memory - you can expect ~500MB of memory usage for a 16M lines CSV file.

Upvotes: 14

Dominique

Reputation: 17493

What about a UNIX simulator?

uniq <filename> >outputfile.txt

(something like that)

Upvotes: 0

Removing duplicates on very large datasets

Answers (3)

Related Questions