Jonovono
Jonovono

Reputation: 2085

Bash. Get intersection from multiple files

So let me explain this a bit more:

I have a directory called tags that has a file for each tag, something like:

tags/
    t1
    t2
    t3

In each of the tag files is a structure like:

<inode> <filename> <filepath>

Of course, each tag file will have a list of many files with that tag (but a file can only appear in the one tag file once). And a file may be in multiple tag files.

What I want to be able to do is call a command like

tags <t1> <t2> 

and have it list the files that have BOTH the tags t1 and t2 in a nice way.

My plan right now was to make a temp file. Basically output the entire file of t1 into it. Then run through each line in t2 and do an awk on the file. And just keep doing that.

But I am wondering if anyone has any other ways. I am not overly familiar with awk, grep etc.

Upvotes: 18

Views: 12158

Answers (4)

Krister Janmore
Krister Janmore

Reputation: 95

Using awk, it's quite easy to create single-command solution that works for an arbitrary number of unsorted files. For large files, it can be much quicker than using sort and pipes, as I show below. By changing $0 to $1 etc., you can also find the intersection of specific columns.

I've included 3 solutions: a simple one that does not handle duplicated lines within files; a more complicated one that does handle them; and an even more complicated one that also handles them and is (over-)engineered for performance. Solutions #1 and #2 assume a version of awk that has the FNR variable, and solution #3 requires gawk's ENDFILE (although this can be circumvented by using FNR == 1 instead and rearranging some logic).


Solution #1 (does not handle duplicated lines within files):

awk ' FNR == 1 { b++ } { a[$0]++ } END { for (i in a) { if (a[i] == b) { print i } } } ' \
    t1 t2 t3

Solution #2 (handles duplicated lines within files):

awk ' FNR == 1 { b++ ; delete c }
      c[$0] == 0 { a[$0]++ ; c[$0] = 1 }
      END { for (i in a) { if (a[i] == b) { print i } } } ' \
    t1 t2 t3

Solution #3 (performant, handles duplicates within files, but complex, and as written relies on gawk's ENDFILE):

awk ' b == 0 { a[$0] = 0 ; next } 
      $0 in a { a[$0] = 1 } 
      ENDFILE { 
          if (b == 0) { b = 1 } 
          else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } } 
      }
      END { for (i in a) { print i } } ' \
    t1 t2 t3

Explanation for #1:

FNR == 1 { b++ }              # when awk reads the first line of a new file, FNR resets 
                              # to 1. every time FNR == 1, we increment a counter 
                              # variable b. 
                              # this counts the number of input files.

{ a[$0]++ }                   # on every line in every file, take the whole line ( $0 ), 
                              # use it as a key in the array a, and increase the value 
                              # of a[$0] by 1.
                              # this counts the number of observations of line $0 across 
                              # all input files.

END { ... }                   # after reading the last line of the last file...

for (i in a) { ... }          # ... loop over the keys of array a ...

if (a[i] == b) { ... }        # ... and if the value at that key is equal to the number 
                              # of input files...

print i                       # ... we print the key - i.e. the line.

Explanation for #2:

c[$0] == 0 { a[$0]++ ; c[$0] = 1 }  # as above, but now we include an array c that 
                                    # indicates if we've seen lines *within* each file.
                                    # if we haven't seen the line before in this file, we 
                                    # increment the count at that line(/key) in array a. 
                                    # we also set the value at that key in array c to 1 
                                    # to note that we've now seen it in this file before.

FNR == 1 { b++ ; delete c }         # as previous solution, but now we also clear the 
                                    # array c between files.

Explanation for #3:

This post is already quite long so I won't do a line-by-line for this solution. But in short: 1) we create an array a that includes every line in the first file as a key, with all values set to 0; 2) on subsequent files, if that line is a key in a, we set the value at that key to 1; 3) at the end of each file, we delete all keys in a that have value 0 (indicating we didn't see it in the previous file), and reset all remaining values to 0; 4) after all files have been read, print every key that's left in a. We get a good speed-up here because instead of having to keep and search through an array of every single line we've seen so far, we're only keeping an array of lines that are the intersection of all previous files, which (usually!) shrinks with each new file.


Benchmarking:

Note: the improvement in runtime appears to get more significant as the lines within files get longer.

### Create test data with *no duplicated lines within files*

mkdir test_dir; cd test_dir

for i in {1..30}; do shuf -i 1-540000 -n 500000 > test_no_dups${i}.txt; done

### Solution #0: based on sort and uniq

time sort test_no_dups*.txt | uniq -c | sed -n 's/^ *30 //p' > intersect_no_dups.txt

# real    0m12.982s
# user    0m51.594s
# sys     0m3.250s

wc -l < intersect_no_dups.txt # 53772

### Solution #1:

time \
awk ' FNR == 1 { b++ }
      { a[$0]++ } 
      END { for (i in a) { if (a[i] == b) { print i } } } ' \
    test_no_dups*.txt \
  > intersect_no_dups.txt

# real    0m8.048s
# user    0m7.484s
# sys     0m0.313s

wc -l < intersect_no_dups.txt # 53772

### Solution #2:

time \
awk ' FNR == 1 { b++ ; delete c }
      c[$0] == 0 { a[$0]++ ; c[$0] = 1 }
      END { for (i in a) { if (a[i] == b) { print i } } } ' \
    test_no_dups*.txt \
  > intersect_no_dups.txt

# real    0m14.965s
# user    0m14.688s
# sys     0m0.297s

wc -l < intersect_no_dups.txt # 53772

### Solution #3:

time \
awk ' b == 0 { a[$0] = 0 ; next } 
      $0 in a { a[$0] = 1 } 
      ENDFILE { 
          if (b == 0) { b = 1 } 
          else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } } 
      }
      END { for (i in a) { print i } } ' \
      test_no_dups*.txt \
  > intersect_no_dups.txt

# real    0m5.929s
# user    0m5.672s
# sys     0m0.250s

wc -l < intersect_no_dups.txt # 53772

And if files can contain duplicates:

### Create test data containing repeated lines (-r: sample w/ replacement)

for i in {1..30} ; do
    shuf -r -i 1-150000 -n 500000 > test_dups${i}.txt
done


### Solution #0: based on sort and uniq

time \
for i in test_dups*.txt ; do
    sort -u "$i"
done \
| sort \
| uniq -c \
| sed -n 's/^ *30 //p' \
> intersect_dups.txt

# real   0m13.503s
# user   0m26.688s
# sys    0m2.297s

wc -l < intersect_dups.txt # 50389

### [Solution #1 won't work here]

### Solution #2:

# note: `delete c` can be replaced with `split("", c)`
time \
awk ' FNR == 1 { b++ ; delete c }
      c[$0] == 0 { a[$0]++ ; c[$0] = 1 }
      END { for (i in a) { if (a[i] == b) { print i } } } ' \
    test_dups*.txt \
  > intersect_dups.txt

# real   0m7.097s
# user   0m6.891s
# sys    0m0.188s

wc -l < intersect_dups.txt # 50389

### Solution #3:

time \
awk ' b == 0 { a[$0] = 0 ; next } 
      $0 in a { a[$0] = 1 } 
      ENDFILE { 
          if (b == 0) { b = 1 } 
          else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } } 
      }
      END { for (i in a) { print i } } ' \
      test_dups*.txt \
  > intersect_dups.txt

# real   0m4.616s
# user   0m4.375s
# sys    0m0.234s

wc -l < intersect_dups.txt # 50389


Upvotes: 3

bsb
bsb

Reputation: 1937

Version for multiple files:

eval `perl -le 'print "cat ",join(" | grep -xF -f- ", @ARGV)' t*`

Expands to:

cat t1 | grep -xF -f- t2 | grep -xF -f- t3

Test files:

seq 0 20 | tee t1; seq 0 2 20 | tee t2; seq 0 3 20 | tee t3

Output:

0
6
12
18

Upvotes: 0

jkshah
jkshah

Reputation: 11713

You could try with comm utility

comm -12 <t1> <t2>

comm with appropriate combination of followinng optionns can be useful for different set operations on file contents.

   -1     suppress column 1 (lines unique to FILE1)

   -2     suppress column 2 (lines unique to FILE2)

   -3     suppress column 3 (lines that appear in both files)

This assumes <t1> and <t2> are sorted. If not, they should be first sorted with sort

Upvotes: 22

Adam Liss
Adam Liss

Reputation: 48330

Can you use

sort t1 t2 | uniq -d

This will combine the two files, sort them, and then display only the lines that appear more than once: that is, the ones that appear in both files.

This assumes that each file contains no duplicates within it, and that the inodes are the same in all the structures for a particular file.

Upvotes: 25

Related Questions