meto
meto

Reputation: 3719

Set union of elements in different files

I have multiple files like that:

file1:

item1
item2
item3

file2:

item1
item5
item3

file3:

item2
item1
item4

I want to have a file with all the unique elements. I could do that with Python, only problem being that each file contains various million lines and I wanted to know if there is better method (maybe using only shell scripts?).

Upvotes: 1

Views: 294

Answers (2)

Alex Reynolds
Alex Reynolds

Reputation: 96994

If you want the elements in common between all three files, another approach is to use a few grep operations:

$ grep -F -f file1 file2 > file1inFile2
$ grep -F -f file1 file3 > file1inFile3
$ grep -F -f file1inFile2 file1inFile3 > elementsInCommon

The -f option specifies searching against a file of patterns (file1 and file1inFile2 in this case). The -F option does a fixed string search.

If you use bash, you can do a fancy one-liner:

$ grep -F -f <(grep -F -f file1 file2) <(grep -F -f file1 file3) > elementsInCommon

Grep searches in sublinear time, I think. So this may get around the usual O(n log n) time cost of presorting very large files with the sort|uniq approach.

You might be able to speed up a fixed-string grep operation even further, specifying the LC_ALL=C environment variable. However, when I explored this, it seems to be a shell default. Still, given the time improvement that is reported, this setting seems worth investigating if you use grep.

Grep may use a fair amount of memory loading patterns, though, which could be an issue given the size of your input files. You might use your smallest of the three files as the pattern source.

If your inputs are already sorted, however, you can walk through each file one line at a time, testing string equality between the three lines. You then either move some input file pointers ahead by a line, or print the equal string that is common to the three inputs. This approach uses O(n) time (you walk though each file once) and O(1) memory (you buffer three lines). More time, but much less memory. Not sure if this can be done with bash built-ins or core utilities, but this is definitely doable with Python, Perl, C, etc.

Upvotes: 2

user798182
user798182

Reputation:

How about:

cat * | uniq

or there may be efficiency gains if each file contains repeats in itself:

for file in *; do cat $file | uniq; done | uniq

If they aren't sorted files, uniq doesn't work, so this may not be more efficient, as you will need:

for file in *; do sort $file | uniq; done | sort | uniq

Upvotes: 5

Related Questions