Reputation: 356
I'd need to find the cleanest way in bash to extract from hundred of files, the files matching some patterns in their contents AND NOT matching some others.
for instance:
for transaction in "TXNA" "TXNB" "TXNC" "TXND" "TXNE" ; do
echo "--> ${transaction}"
grep -L "EXCLUDE_PATTERN1" $(grep -lL "EXCLUDE_PATTERN2" $(grep -Rl --include \*.txt "+${transaction}:" myDir/)) >> myReport.txt
done
so here:
grep -Rl --include \*.txt "+${transaction}:" myDir/
grep in myDir recursively all the files.txt matching the TXNA..B
Then
$(grep -lL "EXCLUDE_PATTERN2" $(grep -Rl --include \*.txt "+${transaction}:" myDir/)
Exclude in the list found before the files containing the patterns EXCLUDE_PATTERN2
and finally:
grep -L "EXCLUDE_PATTERN1"
Exclude in the list found before the files containing the patterns EXCLUDE_PATTERN1
This is quite ugly as I have around 10 patterns to exclude it will become not readable at all.
Any idea for making this code more readable and easy to debug?
Thanks a lot.
Upvotes: 1
Views: 149
Reputation: 464
A bit in deep water as I don't have the time to set up testing before I need to get the family some dinner, but could this awk be closely what you are after?
awk -v m=TXNA -v p1=EXCLUDE_PATTERN1 -v p2=EXCLUDE_PATTERN2 '
$0~m { o[FILENAME] }
$0~p1 { e1[FILENAME] }
$0~p2 { e2[FILENAME] }
END {
for(v in e1) delete o[v]
for(v in e2) delete o[v]
for(v in o) print v
}
' file*
A parameterized version could look something like this:
#!/bin/bash
unset {in,ex}cludes
[[ $# == 0 ]] && set -- -h
printf -v usage %s "\
$(basename "${0}"): find transactions in files that doesn't include certain patterns
-i [PATTERN] Overwrite/add default included matches.
-e [PATTERN] Overwrite/add default excluded matches.
-h Print this help section
Add list of files as parameters to define which files to look in.
Example: $0 file*
"
while getopts "i:e:h" o; do
case "${o}" in
i) includes+=("${OPTARG}") ;;
e) excludes+=("${OPTARG}") ;;
h|*) echo -n "${usage}"; exit ;;
esac
done
shift $((OPTIND-1))
# set default values if no includes or excludes have been set on command line
[[ -z ${includes[*]} ]] && includes=( 'TXNA' 'TXNB' 'TXNC' 'TXND' 'TXNE' )
[[ -z ${excludes[*]} ]] && excludes=( 'EXCLUDE_PATTERN1' 'EXCLUDE_PATTERN2' )
ex=$(IFS=\|; echo "${excludes[*]}")
inc=$(printf "%s\n" "${includes[@]}")
gawk -v m="$inc" -v p="$ex" '
BEGIN {
RS=""
split(m,t,"\n")
for(i in t) {
m = (i==1) ? t[i] : m "|" t[i]
}
m="+(" m "):"
}
$0~m && $0!~p {
for(i in t) {
if($0~"+"t[i]":") o[t[i]][FILENAME]
}
}
END {
for(i in o) {
print i, "matches found in:"
for(f in o[i]) print "\t" f
}
}
' "${@}"
Upvotes: 1
Reputation: 4900
A gawk
(standard Linux awk
) script that scan each file once.
BEGIN {
RS="!@!@!@!@!@!@!@"; # set record seperator to something unlikely matched, causing each file to be read entirely as a single record
wordsListCount = split(wordsListStr, wordsListArr, " +"); # split wordsListStr by newLine into array wordsListArr, saved array length into wordsListCount
for (i in wordsListArr) wordsListArr[i] = gensub( /(^\")([^\"]*)(\"$)/ , "\\2", "g" , wordsListArr[i]); # clear prefix " and suffix "
}
$0 !~ excludePatter1 && $0 !~ excludePatter2 { # for each file (read as single record)
# not matching excludePatter1 and not matching excludePatter2
for (currWord in wordsListArr) { # for each matching word
if ($0 ~ currWord) { # if found a match to current word in file
print FILENAME; # print current file name
next; # proceed to next file
}
}
}
awk -v wordsListFile='"TXNA" "TXNB" "TXNC" "TXND" "TXNE"' \
-v excludePatter1='regEx1' \
-v excludePatter2='regEx2' \
-f script.awk $(find myDir -type f)
Upvotes: 0
Reputation: 660
You could use grep and xargs command to get your result, i.e.:
grep -lHZR -e 'firstpattern' yourdir |xargs -0 grep -lHZ 'secondpattern' |xargs -0 grep -lHZ 'thirdpattern' |xargs -0 grep -LHZ firstantipattern ... |xargs -0 grep -LH lastantipattern
All but first and last grep have same switches (-lHZ
, and optionally -LHZ
for antipatterns). First also have R, to list files in your directory, and last one does not have Z
, to your final output is not null-terminated.
Z
options enables passing output as null-terminated, to allow work with files containing blanks in names, and H
enforces grep to print filename even if only one file is found.
Upvotes: 0
Reputation: 46
I'm not sure I fully understand your question; however, seeing pattern matching while searching for files definitely suggests the use of find
.
for transaction in "TXNA" "TXNB" "TXNC" "TXND" "TXNE" ; do
find ./myDir -name "${yes_pattern}" ! -name "${no_pattern}" -print >> my report.txt
done
Find is a sophisticated tool designed to do what you want -- use man find
to see additional options, including the -exec switch.
Upvotes: 1