Reputation: 143

BASH: Search a string and exactly display the exact number of times a substring happens inside it

I've searched all over and still cant find this simple answer. I'm sure its so easy. Please help if you know how to accomplish this.

sample.txt is:

AAAAA

I want to find the exact times the combination "AAA" happens. If you just use for example

grep -o 'AAA' sample.txt | wc -l

We receive a 1. This is the same as just searching the number of times AAA happens from with a standard text editor search box type search. However, I want the complete number of matches exactly, starting from each individual character which is exactly 3. We get this when we search from each character individually instead of treating each AAA hit like a box type block.

I am looking for the most squeezed in/most possibilities/literal exact number of occurences starting from every individual character of "AAA" in sample.txt, not just blocks of every time it finds it like it does in a normal text editor type search from the search box.

How do we accomplish this, preferrably in AWK? SED, GREP and anything else is fine as well as I can include in a Bash script.

Upvotes: 3

Answers (4)

jxc

Reputation: 13998

I posted this on OP's another post, but it was ignored maybe because I did not add notes and explanation. Just a different approach and any discussions are welcome.

$ awk -v sample="$(<sample.txt)" '{ x=sample; n=0 }$0 != ""{
    while(t=index(x,$0)){ n++; x=substr(x,t+1) } 
    print $0,n
}' combinations

Explanation:

The variables:

sample: is the raw sample text slurp in from the file sample.txt with the -v argument
x: is the targeting string, before each test, the value is reset to sample
$0: is the testing string from the file combination, each line feeds a testing string
n: is the counter, number of occurences of the testing string($0)
t: is the position of the first character of the matched testing string($0) in the targeting string(x)

Update: Added $0 != "" before the main while loop to skip EMPTY strings which lead to unlimited loop.

The code:

    awk -v sample="$(<sample.txt)"   '

        # reset the targeting string(with the sample text) and the counter "n" 
        { x = sample; n = 0 }  

        # below the main block where $0 != "" to skip the EMPTY testing string
        ($0 != ""){
            # the function index(x, $0) returns the position(assigned to "t") of the first character 
            # of the matched testing string($0) in the targeting string(x). 
            # when no match is found, it returns zero and thus step out of the while loop.
            while(t=index(x,$0)) {
                n++;                # increment the number of matches 
                x = substr(x, t+1)  # modify the targeting string to remove all characters before the position(t) inclusively 
            }
            print $0, n             # print the testing string and the counts 
        }
    ' combinations

awk index() is a function much faster than regex matches and it does not need the expensive string comparisons in a brute-force way. attached the tested sample.txt and combinations:

$ more sample.txt 
AAAAAHHHAAHH
HAAAAHHHAAHH
AAHH

$ more combinations 
AA
HH
AAA
HHH
AAH
HHA
ZK

Tested Environment: GNU Awk 4.0.2, Centos 7.3

Upvotes: 1

potong

Reputation: 58420

This might work for you (GNU sed & wc):

sed -r 's/^[^A]*(AA?[^A]+)*AAA/AAA\nAA/;/^AAA/P;D' | wc -l

Lose any characters other than A's, and single or double A's.Then print a triple A and lose the first A and repeat. Finally count the number of lines printed.

Upvotes: 2

LMC

Reputation: 12672

This is the awk version

echo "AAAAA AAA AAAABBAAA"  \
| gawk -v pat="AAA" '{ 
    for(i=1; i<=NF; i++){
        # current field length
        m=length($i)
        #search pattern length
        n=length(pat)
        for(l=1 ; l<m; l++){
            sstr=substr($i,l,n)
            #print i " " $i " sub:" sstr

            # substring matches pattern
            if(sstr ~ pat){
                count++
            }else{
                print "contiguous count on field " i " = " count
                # uncomment next line if non-contiguous matches are not needed
                #break
            }
        }
        print "total count on field " i " = " count
        count=0
    }

}'

Upvotes: 1

John Moon

Reputation: 924

This isn't a trivial problem in bash. As far as I know, standard utils don't support this kind of searching. You can however use standard bash features to implement this behavior in a function. Here's how I would attack the problem, but there are other ways:

#!/bin/bash

search_term="AAA"
text=$(cat sample.txt)
term_len=${#search_term}
occurences=0

# While the text is greater than or equal to the search term length
while [ "${#text}" -ge "$term_len" ]; do

    # Look at just the length of the search term
    text_substr=${text:0:${term_len}}

    # If we see the search term, increment occurences
    if [ "$text_substr" = "$search_term" ]; then
        ((occurences++))
    fi

    # Remove the first character from the main text
    # (e.g. "AAAAA" becomes "AAAA")
    text=${text:1}
done

printf "%d occurences of %s\n" "$occurences" "$search_term"

Upvotes: 1

BASH: Search a string and exactly display the exact number of times a substring happens inside it

Answers (4)

Related Questions