Michael
Michael

Reputation: 5335

find and grep: get filenames

I need to find the reports (.docx files), read them with docx2txt, find the second match of "passed" (excluding "not passed") and save these filenames to text file. Here is what I tried:

OIFS="$IFS"
IFS=$'\n'
for f in $(find . -wholename '*_done/(*Report*.docx' |grep -v appendix)
do
    docx2txt "$f" - | (grep -q -m2 passed || grep -q -v "not passed") || echo $f >> failed
done
IFS="$OIFS"

But this script gives me an empty file. If I replace || to && before echo, all filenames are stored into the file. grep works fine if it is not in the script, as well as docx2txt. What am I doing wrong here?

Upvotes: 0

Views: 211

Answers (2)

Socowi
Socowi

Reputation: 27205

There are quite a lot problems with the grep commands.

grep -q always exits successfully on the first match.

  • With -q the -m2 has no effect. If there is one match grep exits successfully. It does not check if there is a second match.
    To check that there are (at least) two matches, count the matches and then use test/[ ] to check the number of found matches. If there is at most one passed per line, grep -c is sufficient. If there can be multiple matches per line, you need grep -o ... | wc -l.

  • -q and -v together means: Is there at least one line that does not contain the pattern? When grep finds such a line it exits successfully. The only way for this command to fail is an input in which every line contains not passed (this includes the empty file).
    Matching passed but not not passed is trickier than one might suspect. If there can be at most one passed/not passed per line, you can use grep -v 'not passed' | grep passed. Otherwise you need a need negative lookbehind, which is only available in perl compatible regular expressions (PCRE).

In addition to that command | (grep ... || grep ...) might not do what you expect. command produces output only once. After the first grep read some of this output, that read part is gone. The second grep will then continue reading where the first grep stopped.

BTW: for … in $(find … | grep -v …) can be turned into a single, safe find command using -not and -exec.

Solution

If each line contains at most one passed/not passed, use

find . -wholename '*_done/(*Report*.docx' -not -wholename '*appendix*' \
-exec sh -c '[ $(docx2txt "$0" - | grep -v "not passed" | grep -cm2 passed) = 2 ]' {} \; -print

If there can be multiple passed/not passed per line, you need GNU grep or pcregrep:

find . -wholename '*_done/(*Report*.docx' -not -wholename '*appendix*' \
-exec sh -c '[ $(docx2txt "$0" - | grep -Pom2 "(?<!not )passed" | wc -l) = 2 ]' {} \; -print

Upvotes: 2

larsks
larsks

Reputation: 311506

When you run into a problem like this, it's a good idea to remove as much code as possible. If we just take that one line with the multiple grep statements, we can first verify that the current expression doesn't work:

$ echo passed | ((grep -q -m2 passed || grep -q -v "not passed") || echo failed
$ echo not passed | ((grep -q -m2 passed || grep -q -v "not passed") || echo failed

We can see that neither of these commands produces at any output.

Let's think carefully about the logic:

The || operator means "if the first command doesn't succeed, run the second command". So in both cases, the first grep succeeds (because both passed and not passed contain the phrase passed). This means the second grep will never run, and it means that since the first command was successful, the entire grep ... || grep ... command will be successful, and that means the final echo $f will never run.


I was trying to think of a clever way to solve this, but it seems simplest if we make use of a temporary file:

OIFS="$IFS"
IFS=$'\n'
tmpfile=$(mktemp docXXXXXX)
trap "rm -f $tmpfile" EXIT
for f in $(find . -wholename '*_done/(*Report*.docx' |grep -v appendix)
do
    docx2txt "$f" - | head -2 > $tmpfile
    if grep -q passed $tmpfile && ! grep -q 'not passed' $tmpfile; then
      echo $f >> failed
    fi
done
IFS="$OIFS"

Upvotes: 2

Related Questions