sharkbites
sharkbites

Reputation: 121

How to store each occurrence of multiline string in array using bash regex

Given a text file test.txt with contents:

hello
someline1
someline2
...
world1

line that shouldn't match

hello
someline1
someline2
...
world2

How can I store both of these multiline matches in separate array indexes?

I'm currently trying to use regex="hello.*world[12]"

Unfortunately I can only use native Bash, so Perl etc is off the table. Thanks

Upvotes: 0

Views: 168

Answers (2)

Fravadona
Fravadona

Reputation: 17178

I would use awk and mapfile (bash version >= 4.3)

#!/bin/bash

mapfile -d '' arr < <(
    awk '/hello/{f=1} f; /world[12]/ && f {f=0; printf "\000"}' test.txt
)
arr=([0]=$'hello\nsomeline1\nsomeline2\n...\nworld1\n' [1]=$'hello\nsomeline1\nsomeline2\n...\nworld2\n')

notes:

  • awk '/hello/{f=1} f; /world[12]/ && f{f=0; printf "\000"}'
    . when encountering hello, set the flag to true
    . for each line, print it if the flag is true
    . when encountering world[12] and the flag is true, set the flag to false and print a null-byte delimiter

  • mapfile -d '' arr
    split the input into an array in which each element was delimited by a null-byte (instead of \n)


version for older bash:

#!/bin/bash
arr=()
while IFS='' read -r -d '' block
do
    arr+=( "$block" )
done < <(
    awk '/hello/{f=1} f; /world[12]/ && f{f=0; printf "\000"}' test.txt
)

Upvotes: 1

tshiono
tshiono

Reputation: 22032

As the regex of bash does not have such functionality as findall() function of python, we need to capture the matched substring one by one in the loop.

Would you please try the following:

#!/bin/bash

str=$(<test.txt)
regex="hello.world[12]"

while [[ $str =~ ($regex)(.*) ]]; do
    ary+=( "${BASH_REMATCH[1]}" )       # store the match into an array
    str="${BASH_REMATCH[2]}"            # remaining substring
done

for i in "${!ary[@]}"; do               # see the result
    echo "[$i] ${ary[$i]}"
done

Output:

[0] hello
world1
[1] hello
world2

[Edit]
If there exist some lines between "hello" and "world", we need to change the approach as the regex of bash does not support the shortest match. Then how about:

regex1="hello"
regex2="world"

while IFS= read -r line; do
    if [[ $line =~ $regex1 ]]; then
        str="$line"$'\n'
        f=1
    elif (( f )); then
        str+="$line"$'\n'
        if [[ $line =~ $regex2 ]]; then
            ary+=("$str")
            f=0
        fi
    fi
done < test.txt

Upvotes: 2

Related Questions