Reputation: 2085
So let me explain this a bit more:
I have a directory called tags that has a file for each tag, something like:
tags/
t1
t2
t3
In each of the tag files is a structure like:
<inode> <filename> <filepath>
Of course, each tag file will have a list of many files with that tag (but a file can only appear in the one tag file once). And a file may be in multiple tag files.
What I want to be able to do is call a command like
tags <t1> <t2>
and have it list the files that have BOTH the tags t1 and t2 in a nice way.
My plan right now was to make a temp file. Basically output the entire file of t1 into it. Then run through each line in t2 and do an awk on the file. And just keep doing that.
But I am wondering if anyone has any other ways. I am not overly familiar with awk, grep etc.
Upvotes: 18
Views: 12158
Reputation: 95
Using awk
, it's quite easy to create single-command solution that works for an arbitrary number of unsorted files. For large files, it can be much quicker than using sort
and pipes, as I show below. By changing $0
to $1
etc., you can also find the intersection of specific columns.
I've included 3 solutions: a simple one that does not handle duplicated lines within files; a more complicated one that does handle them; and an even more complicated one that also handles them and is (over-)engineered for performance. Solutions #1 and #2 assume a version of awk
that has the FNR
variable, and solution #3 requires gawk
's ENDFILE
(although this can be circumvented by using FNR == 1
instead and rearranging some logic).
Solution #1 (does not handle duplicated lines within files):
awk ' FNR == 1 { b++ } { a[$0]++ } END { for (i in a) { if (a[i] == b) { print i } } } ' \
t1 t2 t3
Solution #2 (handles duplicated lines within files):
awk ' FNR == 1 { b++ ; delete c }
c[$0] == 0 { a[$0]++ ; c[$0] = 1 }
END { for (i in a) { if (a[i] == b) { print i } } } ' \
t1 t2 t3
Solution #3 (performant, handles duplicates within files, but complex, and as written relies on gawk
's ENDFILE
):
awk ' b == 0 { a[$0] = 0 ; next }
$0 in a { a[$0] = 1 }
ENDFILE {
if (b == 0) { b = 1 }
else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } }
}
END { for (i in a) { print i } } ' \
t1 t2 t3
Explanation for #1:
FNR == 1 { b++ } # when awk reads the first line of a new file, FNR resets
# to 1. every time FNR == 1, we increment a counter
# variable b.
# this counts the number of input files.
{ a[$0]++ } # on every line in every file, take the whole line ( $0 ),
# use it as a key in the array a, and increase the value
# of a[$0] by 1.
# this counts the number of observations of line $0 across
# all input files.
END { ... } # after reading the last line of the last file...
for (i in a) { ... } # ... loop over the keys of array a ...
if (a[i] == b) { ... } # ... and if the value at that key is equal to the number
# of input files...
print i # ... we print the key - i.e. the line.
Explanation for #2:
c[$0] == 0 { a[$0]++ ; c[$0] = 1 } # as above, but now we include an array c that
# indicates if we've seen lines *within* each file.
# if we haven't seen the line before in this file, we
# increment the count at that line(/key) in array a.
# we also set the value at that key in array c to 1
# to note that we've now seen it in this file before.
FNR == 1 { b++ ; delete c } # as previous solution, but now we also clear the
# array c between files.
Explanation for #3:
This post is already quite long so I won't do a line-by-line for this solution. But in short: 1) we create an array a
that includes every line in the first file as a key, with all values set to 0
; 2) on subsequent files, if that line is a key in a
, we set the value at that key to 1
; 3) at the end of each file, we delete all keys in a
that have value 0
(indicating we didn't see it in the previous file), and reset all remaining values to 0
; 4) after all files have been read, print every key that's left in a
. We get a good speed-up here because instead of having to keep and search through an array of every single line we've seen so far, we're only keeping an array of lines that are the intersection of all previous files, which (usually!) shrinks with each new file.
Benchmarking:
Note: the improvement in runtime appears to get more significant as the lines within files get longer.
### Create test data with *no duplicated lines within files*
mkdir test_dir; cd test_dir
for i in {1..30}; do shuf -i 1-540000 -n 500000 > test_no_dups${i}.txt; done
### Solution #0: based on sort and uniq
time sort test_no_dups*.txt | uniq -c | sed -n 's/^ *30 //p' > intersect_no_dups.txt
# real 0m12.982s
# user 0m51.594s
# sys 0m3.250s
wc -l < intersect_no_dups.txt # 53772
### Solution #1:
time \
awk ' FNR == 1 { b++ }
{ a[$0]++ }
END { for (i in a) { if (a[i] == b) { print i } } } ' \
test_no_dups*.txt \
> intersect_no_dups.txt
# real 0m8.048s
# user 0m7.484s
# sys 0m0.313s
wc -l < intersect_no_dups.txt # 53772
### Solution #2:
time \
awk ' FNR == 1 { b++ ; delete c }
c[$0] == 0 { a[$0]++ ; c[$0] = 1 }
END { for (i in a) { if (a[i] == b) { print i } } } ' \
test_no_dups*.txt \
> intersect_no_dups.txt
# real 0m14.965s
# user 0m14.688s
# sys 0m0.297s
wc -l < intersect_no_dups.txt # 53772
### Solution #3:
time \
awk ' b == 0 { a[$0] = 0 ; next }
$0 in a { a[$0] = 1 }
ENDFILE {
if (b == 0) { b = 1 }
else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } }
}
END { for (i in a) { print i } } ' \
test_no_dups*.txt \
> intersect_no_dups.txt
# real 0m5.929s
# user 0m5.672s
# sys 0m0.250s
wc -l < intersect_no_dups.txt # 53772
And if files can contain duplicates:
### Create test data containing repeated lines (-r: sample w/ replacement)
for i in {1..30} ; do
shuf -r -i 1-150000 -n 500000 > test_dups${i}.txt
done
### Solution #0: based on sort and uniq
time \
for i in test_dups*.txt ; do
sort -u "$i"
done \
| sort \
| uniq -c \
| sed -n 's/^ *30 //p' \
> intersect_dups.txt
# real 0m13.503s
# user 0m26.688s
# sys 0m2.297s
wc -l < intersect_dups.txt # 50389
### [Solution #1 won't work here]
### Solution #2:
# note: `delete c` can be replaced with `split("", c)`
time \
awk ' FNR == 1 { b++ ; delete c }
c[$0] == 0 { a[$0]++ ; c[$0] = 1 }
END { for (i in a) { if (a[i] == b) { print i } } } ' \
test_dups*.txt \
> intersect_dups.txt
# real 0m7.097s
# user 0m6.891s
# sys 0m0.188s
wc -l < intersect_dups.txt # 50389
### Solution #3:
time \
awk ' b == 0 { a[$0] = 0 ; next }
$0 in a { a[$0] = 1 }
ENDFILE {
if (b == 0) { b = 1 }
else { for (i in a) { if (a[i] == 0) { delete a[i] } else { a[i] = 0 } } }
}
END { for (i in a) { print i } } ' \
test_dups*.txt \
> intersect_dups.txt
# real 0m4.616s
# user 0m4.375s
# sys 0m0.234s
wc -l < intersect_dups.txt # 50389
Upvotes: 3
Reputation: 1937
Version for multiple files:
eval `perl -le 'print "cat ",join(" | grep -xF -f- ", @ARGV)' t*`
Expands to:
cat t1 | grep -xF -f- t2 | grep -xF -f- t3
Test files:
seq 0 20 | tee t1; seq 0 2 20 | tee t2; seq 0 3 20 | tee t3
Output:
0
6
12
18
Upvotes: 0
Reputation: 11713
You could try with comm
utility
comm -12 <t1> <t2>
comm
with appropriate combination of followinng optionns can be useful for different set operations on file contents.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
This assumes <t1>
and <t2>
are sorted. If not, they should be first sorted with sort
Upvotes: 22
Reputation: 48330
Can you use
sort t1 t2 | uniq -d
This will combine the two files, sort them, and then display only the lines that appear more than once: that is, the ones that appear in both files.
This assumes that each file contains no duplicates within it, and that the inodes are the same in all the structures for a particular file.
Upvotes: 25