Reputation: 543
I used file1
as a source of data for file2
and now I need to make sure that every single line of text from file1
occurs somewhere in file2
(and find out which lines are missing, if any). It's probably important to note that while file1
has conveniently one search term per line, the terms can occur anywhere in the file2
including in the middle of a word. Also would help if the matching was case insensitive - doesn't matter if the text in file2
is even in all caps as long as it's there.
The lines in file1
include spaces and all sorts of other special characters like --
.
Upvotes: 15
Views: 5788
Reputation: 58
The highest voted answer to this post -- grep -Fqvf file2 file1
-- is not quite correct; there are a number of issues with it, all stemming from a single major issue: namely, that the direction of comparison is flipped. We are using every line in file2
to search file1
to make sure that all lines in file1
are covered. This fits with how grep works and is elegant, but it doesn't actually solve the problem. I discovered this while comparing two package lists -- one the output of pacman -Qqe
, the other a list I'd made compiling those packages into different groupings to simplify setting up a new computer. I wanted to make sure that I hadn't missed any packages in my groupings.
The first problem is major -- if file2
contains a single empty line, the output will always be false (ie, it will not identify that there are missing lines). This is because the empty line in file2
will match every line of file1
. So with the following files, we do not correctly identify that zsh
is missing from file2
:
file1 file2
acpi acpi
... ...
r r
... ...
yaourt yaourt
zsh
<EOF> <EOF>
$ grep -Fvf file2 file1
[ no output ]
Ok, so we can just strip empty lines, right?
$ grep -Fv "$(grep -ve '^&' file2)" file1
zsh
Great! But now we get to another problem. Let's say we remove yaourt
from file2
. We'd expect the output to now be
yaourt
zsh
But here's what we actually get
$ grep -Fv "$(grep -ve '^&' file2)" file1
zsh
Why is that? Well, it's the same reason that an empty line causes problems. In this case, the line r
in file2
is matching yaourt
in file1
. Removing empty lines only fixed the most egregious case of this more general problem.
Apart from the false negatives here, there are also false positives from not handling the case OP called out --
It's probably important to note that while
file1
has conveniently one search term per line, the terms can occur anywhere in thefile2
including in the middle of a word.
So this would mean that if ohmyzsh
is in file2
, that would be a match for zsh
in file1
. But that would not happen, since we are searching file1
for ohmyzsh
, and obviously, zsh
doesn't match, given it is a substring of ohmyzsh
. This last example illustrates why searching file1
with the lines of file2
categorically will not work. But if we search file2
with the lines of file1
, we will get all the matches in file2
, but not know if we have a match for every line of file1
. The number of matches doesn't help, since we could have multiple matches for, say, sh
(zsh
, bash
, fish
, ...) but no matches for acpi
.
This is all a very long way of saying that this isn't a problem that can be solved with O(1) greps. You'd need to use a loop. With a loop, the problem is trivial.
readarray -t terms < file1 # bash
# zsh: terms=("${(@f)$(< file1)}")
for term in "${terms[@]}"; do # I know `do` "should" be on a separate line; bite me
grep -Fq "$term" file2 ||
{ echo "$term does not appear in file2" && break }
done
Upvotes: 3
Reputation: 1773
if grep -Fqvf file2 file1; then
echo $"There are lines in file1 that don’t occur in file2."
fi
Grep options mean:
-F, --fixed-strings PATTERN is a set of newline-separated fixed strings
-f, --file=FILE obtain PATTERN from FILE
-v, --invert-match select non-matching lines
-q, --quiet, --silent suppress all normal output
Upvotes: 17
Reputation: 40738
You can try
awk -f a.awk file1 file2
where a.awk
is
BEGIN { IGNORECASE=1 }
NR==FNR {
a[$0]++
next
}
{
for (i in a)
if (index($0,i))
delete a[i]
}
END {
for (i in a)
print i
}
Upvotes: 4