Piotr Müller
Piotr Müller

Reputation: 5548

Fastest way to remove duplicate lines in very large .txt files

What is the best way to remove duplicate lines from large .txt files like 1 GB and more ?

Because removing one-after-another duplicates is simple, we can turn this problem to just sorting file.

Assume, that we can't load whole data to RAM, because of it's size.

I'm just waiting to retreive all records from SQL table with one unique index field (I loaded file lines to table earlier) and wondering, does exists way to speed it up.

Upvotes: 4

Views: 3675

Answers (1)

Paul Rubel
Paul Rubel

Reputation: 27222

You could try a bloom filter. While you may get some false positives (though you can get arbitrarily close to 0% at the cost of more processing) it should be pretty fast as you don't need to compare or even do a log(n) search for each line you see.

Upvotes: 2

Related Questions