Arian
Arian

Reputation: 7719

Removing duplicate lines from a large dataset

Let's assume that I have a very large dataset that can not be fit into the memory, there are millions of records in the dataset and I want to remove duplicate rows (actually keeping one row from the duplicates)

What's the most efficient approach in terms of space and time complexity ?

What I thought :

1.Using bloom filter , I am not sure about how it's implemented , but I guess the side effect is having false-positives , in that case how can we find if it's REALLY a duplicate or not ?

2.Using hash values , in this case if we have a small number of duplicate values, the number of unique hash values would be large and again we may have problem with memory ,

Upvotes: 3

Views: 2315

Answers (2)

hivert
hivert

Reputation: 10657

Your solution 2: using hash value doesn't force a memory problem. You just have to partition the hash space into slices that fits into memory. More precisely:

Consider a hash table storing the set of records, each record is only represented by its index in the table. Say for example that such a hash table will be 4GB. Then you split your hash space in k=4 slice. Depending on the two last digits of the hash value, each record goes into one of the slice. So the algorithm would go roughly as follows:

let k = 2^M
for i from 0 to k-1:
    t = new table
    for each record r on the disk:
        h = hashvalue(r)
        if (the M last bit of h == i) {
            insert r into t with respect to hash value h >> M
        }
    search t for duplicate and remove them
    delete t from memory

The drawback is that you have to hash each record k times. The advantage is that is it can trivially be distributed.

Here is a prototype in Python:

# Fake huge database on the disks
records = ["askdjlsd", "kalsjdld", "alkjdslad", "askdjlsd"]*100

M = 2
mask = 2**(M+1)-1
class HashLink(object):
    def __init__(self, idx):
        self._idx = idx
        self._hash = hash(records[idx]) # file access

    def __hash__(self):
        return self._hash >> M

    # hashlink are equal if they link to equal objects
    def __eq__(self, other):
        return records[self._idx] == records[other._idx] # file access

    def __repr__(self):
        return str(records[self._idx])

to_be_deleted = list()
for i in range(2**M):
    t = set()
    for idx, rec in enumerate(records):
        h = hash(rec)
        if (h & mask == i):
            if HashLink(idx) in t:
                to_be_deleted.append(idx)
            else:
                t.add(HashLink(idx))

The result is:

>>> [records[idx] for idx in range(len(records)) if idx not in to_be_deleted]
['askdjlsd', 'kalsjdld', 'alkjdslad']

Upvotes: 2

Karthikeyan
Karthikeyan

Reputation: 990

Since you need deletion of duplicate item, without sorting or indexing, you may end up scanning entire dataset for every delete, which is unbearably costly in terms of performance. Given that, you may think of some external sorting for this, or a database. If you don't care about ordering of output dataset. Create 'n' number of files which stores a subset of input dataset according to hash of the record or record's key. Get the hash and take modulo by 'n' and get the right output file to store the content. Since size of every output file is small now, your delete operation would be very fast; for output file you could use normal file, or a sqlite/ berkeley db. I would recommend sqlite/bdb though. In order to avoid scanning for every write to output file, you could have a front-end bloom filter for every output file. Bloom filter isn't that difficult. Lot of libraries are available. Calculating 'n' depends on your main memory, I would say. Go with pessimistic, huge value for 'n'. Once your work is done, concatenate all the output files into a single one.

Upvotes: 1

Related Questions