Reputation: 117
I have a large file (say 10 terabytes) with stream of MD5 hashes (which contains duplicates), I am given a memory of 10MB(very limited) and unlimited hard disc space. Find all the unique hashes(eliminating duplicates) using given conditions. Please help, this is obviously not a homework question
Upvotes: 3
Views: 939
Reputation: 532
I like Zim-Zam's solution...proposing a small variation.
If we can assume that the fingerprints are distributed uniformly over the 128 bit space, then can we use something like Bucket sort to bucket'ize the fingerprints into (smaller) bucket files, sort the bucket files individually and then merge the bucket files into one sorted file using a heap? This might reduce the nlogn cost.
Upvotes: 0
Reputation: 70492
If performance doesn't matter, and your file system has no limits, then you can simply create a file for every hash. If during creation, you encounter EEXIST
, then you have a duplicate, and it can be skipped.
for (each hash) {
r = open(hash_to_filename(hash), O_CREAT|O_EXCL);
if (r < 0) {
if (errno == EEXIST) continue;
perror(hash);
exit(EXIT_FAILURE);
}
close(r);
output(hash);
}
The advantage of this is that it preserves the order of the hash values first occurrence in the stream.
The actual performance of this solution depends on the performance of the file system. If the files are organized in a B-Tree, then the performance will be roughly O(N log(N)). If the file system uses a hash table to organize the files, then the performance is expected to be O(N), but it depends on how often collisions occur (and the constant factor is high, because of the disk access).
Upvotes: 3
Reputation: 18148
You can sort the hashes with an external sorting algorithm (e.g. with a polyphase merge sort), after which you just need to traverse the file and skip any hashes that equal the most recent hash
hash mostRecentHash;
while(fileHasHashes) {
temp = fileWithDuplicates.readHash();
if(!hashesAreEqual(mostRecentHash, temp)) {
mostRecentHash = temp;
fileWithoutDuplicates.writeHash(mostRecentHash);
}
}
Upvotes: 8