anees al-najjar
anees al-najjar

Reputation: 17

Matching Regex groups using bash shell

I would like to match and fetch many strings using regex groups using bash script, Linux.

I was successful if I do small matching groups with sed command. However, if I have a larger number of matching groups, it does not work properly.

This is my code:

   txt="toknA: ABCDEFGGDSSSE  toknB 1500        SEABCDEFGGDSSSEABCDEFGGDSSSE  1235643 CDEFGGDSSSE       toknC 64  ABCDEFGGDSSSE        ABCDEFGGDSSSE  toknD 1000  ABCDEFGGDSSSE        toknE 14306  toknF 16402238        toknG 0  toknH 0  toknI 0  toknJ 0        toknK 4930  toknL 333494 toknM fdvd swsw"

echo $txt | sed -r 's/^(toknA).*(toknB \d+).*(toknC \d+).*(toknD \d+).*(toknE \d+).*(toknF).*(toknG).*(toknH).*(toknI).*(toknJ).*(toknK).*(toknL)/\1 \2 \3 \4 \5 \6 \7 \8 \9 \10 \11 \12/'

This is what I have got:

toknA: ABCDEFGGDSSSE toknB 1500 SEABCDEFGGDSSSEABCDEFGGDSSSE 1235643 CDEFGGDSSSE toknC 64 ABCDEFGGDSSSE ABCDEFGGDSSSE toknD 1000 ABCDEFGGDSSSE toknE 14306 toknF 16402238 toknG 0 toknH 0 toknI 0 toknJ 0 toknK 4930 toknL 333494 toknM fdvd swsw

What I expected to get is:

toknA toknB 1500 toknC 64 toknD 1000 toknE 14306 toknF toknG toknH toknI toknJ toknK toknL

Any ideas why is that happening? can be solved in another way?

Upvotes: 1

Views: 60

Answers (2)

Ed Morton
Ed Morton

Reputation: 203324

With GNU awk for the 3rd arg to match():

$ echo "$txt" | awk '
    match($0,/^(toknA).*(toknB [0-9]+).*(toknC [0-9]+).*(toknD [0-9]+).*(toknE [0-9]+).*(toknF).*(toknG).*(toknH).*(toknI).*(toknJ).*(toknK).*(toknL)/,a) {
        for (i=1; i in a; i++) {
            printf "%s%s", (i>1? OFS : ""), a[i]
        }
        print ""
    }'
toknA toknB 1500 toknC 64 toknD 1000 toknE 14306 toknF toknG toknH toknI toknJ toknK toknL

Upvotes: 0

glenn jackman
glenn jackman

Reputation: 246774

With just bash regex matching [[ a =~ b ]] -- captured pieces are stored in the BASH_REMATCH array

regex='(toknA)'
for x in {B..E}; do regex+=".*(tokn${x}[[:blank:]]+[[:digit:]]+)"; done
for x in {F..L}; do regex+=".*(tokn${x})"; done

if [[ $txt =~ $regex ]]; then
    for i in "${!BASH_REMATCH[@]}"; do
        printf "%d\t%q\n" $i "${BASH_REMATCH[i]}"
    done
    echo

    result=${BASH_REMATCH[*]:1}  # join into a single string
    echo "$result"
fi

outputs

0   toknA:\ ABCDEFGGDSSSE\ \ toknB\ 1500\ \ \ \ \ \ \ \ SEABCDEFGGDSSSEABCDEFGGDSSSE\ \ 1235643\ CDEFGGDSSSE\ \ \ \ \ \ \ toknC\ 64\ \ ABCDEFGGDSSSE\ \ \ \ \ \ \ \ ABCDEFGGDSSSE\ \ toknD\ 1000\ \ ABCDEFGGDSSSE\ \ \ \ \ \ \ \ toknE\ 14306\ \ toknF\ 16402238\ \ \ \ \ \ \ \ toknG\ 0\ \ toknH\ 0\ \ toknI\ 0\ \ toknJ\ 0\ \ \ \ \ \ \ \ toknK\ 4930\ \ toknL
1   toknA
2   toknB\ 1500
3   toknC\ 64
4   toknD\ 1000
5   toknE\ 14306
6   toknF
7   toknG
8   toknH
9   toknI
10  toknJ
11  toknK
12  toknL

toknA toknB 1500 toknC 64 toknD 1000 toknE 14306 toknF toknG toknH toknI toknJ toknK toknL

Upvotes: 1

Related Questions