Reputation: 3719
I have multiple files like that:
file1:
item1
item2
item3
file2:
item1
item5
item3
file3:
item2
item1
item4
I want to have a file with all the unique elements. I could do that with Python, only problem being that each file contains various million lines and I wanted to know if there is better method (maybe using only shell scripts?).
Upvotes: 1
Views: 294
Reputation: 96994
If you want the elements in common between all three files, another approach is to use a few grep
operations:
$ grep -F -f file1 file2 > file1inFile2
$ grep -F -f file1 file3 > file1inFile3
$ grep -F -f file1inFile2 file1inFile3 > elementsInCommon
The -f
option specifies searching against a file of patterns (file1
and file1inFile2
in this case). The -F
option does a fixed string search.
If you use bash
, you can do a fancy one-liner:
$ grep -F -f <(grep -F -f file1 file2) <(grep -F -f file1 file3) > elementsInCommon
Grep searches in sublinear time, I think. So this may get around the usual O(n log n) time cost of presorting very large files with the sort|uniq
approach.
You might be able to speed up a fixed-string grep
operation even further, specifying the LC_ALL=C
environment variable. However, when I explored this, it seems to be a shell default. Still, given the time improvement that is reported, this setting seems worth investigating if you use grep
.
Grep may use a fair amount of memory loading patterns, though, which could be an issue given the size of your input files. You might use your smallest of the three files as the pattern source.
If your inputs are already sorted, however, you can walk through each file one line at a time, testing string equality between the three lines. You then either move some input file pointers ahead by a line, or print the equal string that is common to the three inputs. This approach uses O(n) time (you walk though each file once) and O(1) memory (you buffer three lines). More time, but much less memory. Not sure if this can be done with bash
built-ins or core utilities, but this is definitely doable with Python, Perl, C, etc.
Upvotes: 2
Reputation:
How about:
cat * | uniq
or there may be efficiency gains if each file contains repeats in itself:
for file in *; do cat $file | uniq; done | uniq
If they aren't sorted files, uniq doesn't work, so this may not be more efficient, as you will need:
for file in *; do sort $file | uniq; done | sort | uniq
Upvotes: 5