Reputation: 5548
What is the best way to remove duplicate lines from large .txt files like 1 GB and more ?
Because removing one-after-another duplicates is simple, we can turn this problem to just sorting file.
Assume, that we can't load whole data to RAM, because of it's size.
I'm just waiting to retreive all records from SQL table with one unique index field (I loaded file lines to table earlier) and wondering, does exists way to speed it up.
Upvotes: 4
Views: 3675
Reputation: 27222
You could try a bloom filter. While you may get some false positives (though you can get arbitrarily close to 0% at the cost of more processing) it should be pretty fast as you don't need to compare or even do a log(n) search for each line you see.
Upvotes: 2