username
username

Reputation: 117

find unique values from a large file

I have a large file (say 10 terabytes) with stream of MD5 hashes (which contains duplicates), I am given a memory of 10MB(very limited) and unlimited hard disc space. Find all the unique hashes(eliminating duplicates) using given conditions. Please help, this is obviously not a homework question

Upvotes: 3

Views: 939

Answers (3)

lsk
lsk

Reputation: 532

I like Zim-Zam's solution...proposing a small variation.

If we can assume that the fingerprints are distributed uniformly over the 128 bit space, then can we use something like Bucket sort to bucket'ize the fingerprints into (smaller) bucket files, sort the bucket files individually and then merge the bucket files into one sorted file using a heap? This might reduce the nlogn cost.

Upvotes: 0

jxh
jxh

Reputation: 70492

If performance doesn't matter, and your file system has no limits, then you can simply create a file for every hash. If during creation, you encounter EEXIST, then you have a duplicate, and it can be skipped.

for (each hash) {
    r = open(hash_to_filename(hash), O_CREAT|O_EXCL);
    if (r < 0) {
        if (errno == EEXIST) continue;
        perror(hash);
        exit(EXIT_FAILURE);
    }
    close(r);
    output(hash);
}

The advantage of this is that it preserves the order of the hash values first occurrence in the stream.

The actual performance of this solution depends on the performance of the file system. If the files are organized in a B-Tree, then the performance will be roughly O(N log(N)). If the file system uses a hash table to organize the files, then the performance is expected to be O(N), but it depends on how often collisions occur (and the constant factor is high, because of the disk access).

Upvotes: 3

Zim-Zam O&#39;Pootertoot
Zim-Zam O&#39;Pootertoot

Reputation: 18148

You can sort the hashes with an external sorting algorithm (e.g. with a polyphase merge sort), after which you just need to traverse the file and skip any hashes that equal the most recent hash

hash mostRecentHash;
while(fileHasHashes) {
    temp = fileWithDuplicates.readHash();
    if(!hashesAreEqual(mostRecentHash, temp)) {
        mostRecentHash = temp;
        fileWithoutDuplicates.writeHash(mostRecentHash);
    }
}

Upvotes: 8

Related Questions