Reputation: 129
I have a dataset of ~800gb worth of text files, with about 50k .txt files in total.
I'd like to go through and make a master .txt file from these, with all duplicate lines removed from all txt files.
I can't find a way to do this that isn't going to take months for my computer to process, idealy i'd like to keep it less than a week.
Upvotes: 0
Views: 24
Reputation: 198456
sort -u <data.txt >clean.txt
All you need is a large disk.
sort
is quite efficient: it will automatically split the file into manageable bites, sort each one separately, then merge them (which can be done in O(N) time); and while merging, it will discard the dupes (due to -u
option). But you will need at least the space for the output file, plus the space for all the intermediate files.
Upvotes: 1