Reputation: 356

Bash: match in a list of files, cases including and excluding pattern

I'd need to find the cleanest way in bash to extract from hundred of files, the files matching some patterns in their contents AND NOT matching some others.

for instance:

  for transaction in "TXNA" "TXNB" "TXNC" "TXND" "TXNE" ; do   
      echo "--> ${transaction}"   
      grep -L "EXCLUDE_PATTERN1" $(grep -lL "EXCLUDE_PATTERN2" $(grep -Rl --include \*.txt "+${transaction}:" myDir/))   >> myReport.txt
    done

so here:

grep -Rl --include \*.txt "+${transaction}:" myDir/

grep in myDir recursively all the files.txt matching the TXNA..B

Then

$(grep -lL "EXCLUDE_PATTERN2" $(grep -Rl --include \*.txt "+${transaction}:" myDir/)

Exclude in the list found before the files containing the patterns EXCLUDE_PATTERN2

and finally:

grep -L "EXCLUDE_PATTERN1"

Exclude in the list found before the files containing the patterns EXCLUDE_PATTERN1

This is quite ugly as I have around 10 patterns to exclude it will become not readable at all.

Any idea for making this code more readable and easy to debug?

Thanks a lot.

Upvotes: 1

Answers (4)

Kaffe Myers

Reputation: 464

A bit in deep water as I don't have the time to set up testing before I need to get the family some dinner, but could this awk be closely what you are after?

awk -v m=TXNA -v p1=EXCLUDE_PATTERN1 -v p2=EXCLUDE_PATTERN2 '
    $0~m { o[FILENAME] }
    $0~p1 { e1[FILENAME] }
    $0~p2 { e2[FILENAME] }
    END {
        for(v in e1) delete o[v]
        for(v in e2) delete o[v]
        for(v in o) print v
    }
' file*

A parameterized version could look something like this:

#!/bin/bash

unset {in,ex}cludes
[[ $# == 0 ]] && set -- -h

printf -v usage %s "\
$(basename "${0}"): find transactions in files that doesn't include certain patterns
    -i [PATTERN]  Overwrite/add default included matches.
    -e [PATTERN]  Overwrite/add default excluded matches.
    -h            Print this help section

Add list of files as parameters to define which files to look in.
Example: $0 file*
"

while getopts "i:e:h" o; do
    case "${o}" in
        i) includes+=("${OPTARG}") ;;
        e) excludes+=("${OPTARG}") ;;
        h|*) echo -n "${usage}"; exit   ;;
    esac
done
shift $((OPTIND-1))

# set default values if no includes or excludes have been set on command line
[[ -z ${includes[*]} ]] && includes=( 'TXNA' 'TXNB' 'TXNC' 'TXND' 'TXNE' )
[[ -z ${excludes[*]} ]] && excludes=( 'EXCLUDE_PATTERN1' 'EXCLUDE_PATTERN2' )

ex=$(IFS=\|; echo "${excludes[*]}")
inc=$(printf "%s\n" "${includes[@]}")

gawk -v m="$inc" -v p="$ex" '
    BEGIN {
        RS=""
        split(m,t,"\n")
        for(i in t) {
            m = (i==1) ? t[i] : m "|" t[i]
        }
        m="+(" m "):"
    }
    $0~m && $0!~p {
        for(i in t) {
            if($0~"+"t[i]":") o[t[i]][FILENAME]
        }
    }
    END {
            for(i in o) {
                print i, "matches found in:"
                for(f in o[i]) print "\t" f
            }
    }
' "${@}"

Upvotes: 1

Dudi Boy

Reputation: 4900

A gawk (standard Linux awk) script that scan each file once.

script.awk

BEGIN {
  RS="!@!@!@!@!@!@!@"; # set record seperator to something unlikely matched, causing each file to be read entirely as a single record
  wordsListCount = split(wordsListStr, wordsListArr, " +"); # split wordsListStr by newLine into array wordsListArr, saved array length into wordsListCount
  for (i in wordsListArr) wordsListArr[i] = gensub( /(^\")([^\"]*)(\"$)/ , "\\2", "g" , wordsListArr[i]); # clear prefix " and suffix " 
}
$0 !~ excludePatter1 && $0 !~ excludePatter2 { # for each file (read as single record) 
                                               # not matching excludePatter1  and not matching excludePatter2 
  for (currWord in wordsListArr) { # for each matching word
    if ($0 ~ currWord) {  # if found a match to current word in file
          print FILENAME; # print current file name
          next; # proceed to next file
    }
  }
}

running command:

awk -v wordsListFile='"TXNA" "TXNB" "TXNC" "TXND" "TXNE"' \
    -v excludePatter1='regEx1' \
    -v excludePatter2='regEx2' \
    -f script.awk $(find myDir -type f)

Upvotes: 0

Maciej Wrobel

Reputation: 660

You could use grep and xargs command to get your result, i.e.:

grep -lHZR -e 'firstpattern' yourdir |xargs -0 grep -lHZ 'secondpattern' |xargs -0 grep -lHZ 'thirdpattern' |xargs -0 grep -LHZ  firstantipattern ... |xargs -0 grep -LH  lastantipattern

All but first and last grep have same switches (-lHZ, and optionally -LHZ for antipatterns). First also have R, to list files in your directory, and last one does not have Z, to your final output is not null-terminated.

Z options enables passing output as null-terminated, to allow work with files containing blanks in names, and H enforces grep to print filename even if only one file is found.

Upvotes: 0

Dan Gerrity

Reputation: 46

I'm not sure I fully understand your question; however, seeing pattern matching while searching for files definitely suggests the use of find.

for transaction in "TXNA" "TXNB" "TXNC" "TXND" "TXNE" ; do 
    find ./myDir -name "${yes_pattern}" ! -name "${no_pattern}" -print >> my report.txt
done

Find is a sophisticated tool designed to do what you want -- use man find to see additional options, including the -exec switch.

Upvotes: 1

Bash: match in a list of files, cases including and excluding pattern

Answers (4)

script.awk

running command:

Related Questions