Michael
Michael

Reputation: 42100

How to remove duplicates from a file?

How to remove duplicates from a large file of large numbers ? This is an interview question about algorithms and data structures rather than sort -u and stuff like that.

I assume there that the file does not fit in memory and the numbers range is large enough so I cannot use in-memory count/bucket sort.

The only option is see is to sort the file (e.g. merge sort) and pass the sorted file again to filter out duplicates.

Does it make sense. Are there other options?

Upvotes: 1

Views: 344

Answers (3)

user1277476
user1277476

Reputation: 2909

Mergesort or Timsort (which is an improved mergesort) is a good idea. EG: http://stromberg.dnsalias.org/~strombrg/sort-comparison/

You might also be able to get some mileage out of a bloom filter. It's a probabilistic datastructure that has low memory requirements. You can adjust the error probability with bloom filters. EG: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/ You could use one to toss out values that are definitely unique, and then scrutinize the values that are probably not unique more closely via some other method. This would be especially valuable if your input dataset has a lot of duplicates. It doesn't require comparing elements directly, it just hashes the elements using a potentially-large number of hash functions.

You could also use an on-disk BTree or 2-3 Tree or similar. These are often stored on disk, and keep key/value pairs in key order.

Upvotes: 1

Will Ness
Will Ness

Reputation: 71109

You won't even need separate pass over sorted data if you use a duplicates-removing variant of "merge" (a.k.a. "union") in your mergesort. Hash table should be empty-ish to perform well, i.e. be even bigger than the file itself - and we're told that the file itself is big.

Look up multi-way merge (e.g. here) and external sorting.

Upvotes: 3

amit
amit

Reputation: 178491

Yes, the solution makes sense.

An alternative is build a file-system-based hash table, and maintain it as a set. First iterate on all elements and insert them to your set, and later - in a second iteration, print all elements in the set.

It is implementation and data dependent which will perform better, in terms of big-O complexity, the hash offers O(n) time average case and O(n^2) worst case, while the merge sort option offers more stable O(nlogn) solution.

Upvotes: 2

Related Questions