How can I find & delete duplicate strings from ~800gb worth of text files?

Question

I have a dataset of ~800gb worth of text files, with about 50k .txt files in total.

I'd like to go through and make a master .txt file from these, with all duplicate lines removed from all txt files.

I can't find a way to do this that isn't going to take months for my computer to process, idealy i'd like to keep it less than a week.

Amadan · Accepted Answer

sort -u clean.txt

All you need is a large disk.

sort is quite efficient: it will automatically split the file into manageable bites, sort each one separately, then merge them (which can be done in O(N) time); and while merging, it will discard the dupes (due to -u option). But you will need at least the space for the output file, plus the space for all the intermediate files.

How can I find & delete duplicate strings from ~800gb worth of text files?

Answers (1)

Related Questions

How can I find &amp; delete duplicate strings from ~800gb worth of text files?

Answers (1)

Related Questions

How can I find & delete duplicate strings from ~800gb worth of text files?