Reputation: 143
I've searched all over and still cant find this simple answer. I'm sure its so easy. Please help if you know how to accomplish this.
sample.txt is:
AAAAA
I want to find the exact times the combination "AAA" happens. If you just use for example
grep -o 'AAA' sample.txt | wc -l
We receive a 1. This is the same as just searching the number of times AAA happens from with a standard text editor search box type search. However, I want the complete number of matches exactly, starting from each individual character which is exactly 3. We get this when we search from each character individually instead of treating each AAA hit like a box type block.
I am looking for the most squeezed in/most possibilities/literal exact number of occurences starting from every individual character of "AAA" in sample.txt, not just blocks of every time it finds it like it does in a normal text editor type search from the search box.
How do we accomplish this, preferrably in AWK? SED, GREP and anything else is fine as well as I can include in a Bash script.
Upvotes: 3
Views: 261
Reputation: 13998
I posted this on OP's another post, but it was ignored maybe because I did not add notes and explanation. Just a different approach and any discussions are welcome.
$ awk -v sample="$(<sample.txt)" '{ x=sample; n=0 }$0 != ""{
while(t=index(x,$0)){ n++; x=substr(x,t+1) }
print $0,n
}' combinations
Explanation:
The variables:
sample
: is the raw sample text slurp in from the file sample.txt with the -v argumentx
: is the targeting string, before each test, the value is reset to sample
$0
: is the testing string from the file combination
, each line feeds a testing stringn
: is the counter, number of occurences of the testing string($0)t
: is the position of the first character of the matched testing string($0) in the targeting string(x)Update: Added $0 != ""
before the main while loop to skip EMPTY strings which lead to unlimited loop.
The code:
awk -v sample="$(<sample.txt)" '
# reset the targeting string(with the sample text) and the counter "n"
{ x = sample; n = 0 }
# below the main block where $0 != "" to skip the EMPTY testing string
($0 != ""){
# the function index(x, $0) returns the position(assigned to "t") of the first character
# of the matched testing string($0) in the targeting string(x).
# when no match is found, it returns zero and thus step out of the while loop.
while(t=index(x,$0)) {
n++; # increment the number of matches
x = substr(x, t+1) # modify the targeting string to remove all characters before the position(t) inclusively
}
print $0, n # print the testing string and the counts
}
' combinations
awk index() is a function much faster than regex matches and it does not need the expensive string comparisons in a brute-force way. attached the tested sample.txt and combinations:
$ more sample.txt
AAAAAHHHAAHH
HAAAAHHHAAHH
AAHH
$ more combinations
AA
HH
AAA
HHH
AAH
HHA
ZK
Tested Environment: GNU Awk 4.0.2, Centos 7.3
Upvotes: 1
Reputation: 58420
This might work for you (GNU sed & wc):
sed -r 's/^[^A]*(AA?[^A]+)*AAA/AAA\nAA/;/^AAA/P;D' | wc -l
Lose any characters other than A
's, and single or double A
's.Then print a triple A
and lose the first A
and repeat. Finally count the number of lines printed.
Upvotes: 2
Reputation: 12672
This is the awk version
echo "AAAAA AAA AAAABBAAA" \
| gawk -v pat="AAA" '{
for(i=1; i<=NF; i++){
# current field length
m=length($i)
#search pattern length
n=length(pat)
for(l=1 ; l<m; l++){
sstr=substr($i,l,n)
#print i " " $i " sub:" sstr
# substring matches pattern
if(sstr ~ pat){
count++
}else{
print "contiguous count on field " i " = " count
# uncomment next line if non-contiguous matches are not needed
#break
}
}
print "total count on field " i " = " count
count=0
}
}'
Upvotes: 1
Reputation: 924
This isn't a trivial problem in bash. As far as I know, standard utils don't support this kind of searching. You can however use standard bash features to implement this behavior in a function. Here's how I would attack the problem, but there are other ways:
#!/bin/bash
search_term="AAA"
text=$(cat sample.txt)
term_len=${#search_term}
occurences=0
# While the text is greater than or equal to the search term length
while [ "${#text}" -ge "$term_len" ]; do
# Look at just the length of the search term
text_substr=${text:0:${term_len}}
# If we see the search term, increment occurences
if [ "$text_substr" = "$search_term" ]; then
((occurences++))
fi
# Remove the first character from the main text
# (e.g. "AAAAA" becomes "AAAA")
text=${text:1}
done
printf "%d occurences of %s\n" "$occurences" "$search_term"
Upvotes: 1