E K
E K

Reputation: 129

How can I find & delete duplicate strings from ~800gb worth of text files?

I have a dataset of ~800gb worth of text files, with about 50k .txt files in total.

I'd like to go through and make a master .txt file from these, with all duplicate lines removed from all txt files.

I can't find a way to do this that isn't going to take months for my computer to process, idealy i'd like to keep it less than a week.

Upvotes: 0

Views: 24

Answers (1)

Amadan
Amadan

Reputation: 198456

sort -u <data.txt >clean.txt

All you need is a large disk.

sort is quite efficient: it will automatically split the file into manageable bites, sort each one separately, then merge them (which can be done in O(N) time); and while merging, it will discard the dupes (due to -u option). But you will need at least the space for the output file, plus the space for all the intermediate files.

Upvotes: 1

Related Questions