user2044638
user2044638

Reputation: 543

Check if all lines from one file are present somewhere in another file

I used file1 as a source of data for file2 and now I need to make sure that every single line of text from file1 occurs somewhere in file2 (and find out which lines are missing, if any). It's probably important to note that while file1 has conveniently one search term per line, the terms can occur anywhere in the file2 including in the middle of a word. Also would help if the matching was case insensitive - doesn't matter if the text in file2 is even in all caps as long as it's there.

The lines in file1 include spaces and all sorts of other special characters like --.

Upvotes: 15

Views: 5788

Answers (3)

David Anderson
David Anderson

Reputation: 58

The highest voted answer to this post -- grep -Fqvf file2 file1 -- is not quite correct; there are a number of issues with it, all stemming from a single major issue: namely, that the direction of comparison is flipped. We are using every line in file2 to search file1 to make sure that all lines in file1 are covered. This fits with how grep works and is elegant, but it doesn't actually solve the problem. I discovered this while comparing two package lists -- one the output of pacman -Qqe, the other a list I'd made compiling those packages into different groupings to simplify setting up a new computer. I wanted to make sure that I hadn't missed any packages in my groupings.

The first problem is major -- if file2 contains a single empty line, the output will always be false (ie, it will not identify that there are missing lines). This is because the empty line in file2 will match every line of file1. So with the following files, we do not correctly identify that zsh is missing from file2:

file1                        file2

acpi                         acpi
...                          ...
r                            r
...                          ...
yaourt                       yaourt
zsh                          
<EOF>                        <EOF>

$ grep -Fvf file2 file1
[ no output ]

Ok, so we can just strip empty lines, right?

$ grep -Fv "$(grep -ve '^&' file2)" file1
zsh

Great! But now we get to another problem. Let's say we remove yaourt from file2. We'd expect the output to now be

yaourt
zsh

But here's what we actually get

$ grep -Fv "$(grep -ve '^&' file2)" file1
zsh

Why is that? Well, it's the same reason that an empty line causes problems. In this case, the line r in file2 is matching yaourt in file1. Removing empty lines only fixed the most egregious case of this more general problem.

Apart from the false negatives here, there are also false positives from not handling the case OP called out --

It's probably important to note that while file1 has conveniently one search term per line, the terms can occur anywhere in the file2 including in the middle of a word.

So this would mean that if ohmyzsh is in file2, that would be a match for zsh in file1. But that would not happen, since we are searching file1 for ohmyzsh, and obviously, zsh doesn't match, given it is a substring of ohmyzsh. This last example illustrates why searching file1 with the lines of file2 categorically will not work. But if we search file2 with the lines of file1, we will get all the matches in file2, but not know if we have a match for every line of file1. The number of matches doesn't help, since we could have multiple matches for, say, sh (zsh, bash, fish, ...) but no matches for acpi.

This is all a very long way of saying that this isn't a problem that can be solved with O(1) greps. You'd need to use a loop. With a loop, the problem is trivial.

readarray -t terms < file1 # bash
# zsh: terms=("${(@f)$(< file1)}")

for term in "${terms[@]}"; do # I know `do` "should" be on a separate line; bite me
  grep -Fq "$term" file2 ||
    { echo "$term does not appear in file2" && break }
done

Upvotes: 3

Dmitry Alexandrov
Dmitry Alexandrov

Reputation: 1773

if grep -Fqvf file2 file1; then
    echo $"There are lines in file1 that don’t occur in file2."
fi

Grep options mean:

-F, --fixed-strings       PATTERN is a set of newline-separated fixed strings
-f, --file=FILE           obtain PATTERN from FILE
-v, --invert-match        select non-matching lines
-q, --quiet, --silent     suppress all normal output

Upvotes: 17

H&#229;kon H&#230;gland
H&#229;kon H&#230;gland

Reputation: 40738

You can try

awk -f a.awk file1 file2

where a.awk is

BEGIN { IGNORECASE=1 }
NR==FNR {
    a[$0]++
    next
}
{
    for (i in a) 
        if (index($0,i)) 
            delete a[i]
}

END {
    for (i in a)
        print i
}

Upvotes: 4

Related Questions