Fastest way to remove duplicate lines in very large .txt files

Question

What is the best way to remove duplicate lines from large .txt files like 1 GB and more ?

Because removing one-after-another duplicates is simple, we can turn this problem to just sorting file.

Assume, that we can't load whole data to RAM, because of it's size.

I'm just waiting to retreive all records from SQL table with one unique index field (I loaded file lines to table earlier) and wondering, does exists way to speed it up.

Paul Rubel · Accepted Answer

You could try a bloom filter. While you may get some false positives (though you can get arbitrarily close to 0% at the cost of more processing) it should be pretty fast as you don't need to compare or even do a log(n) search for each line you see.

Fastest way to remove duplicate lines in very large .txt files

Answers (1)

Related Questions