Theo Sweeny
Theo Sweeny

Reputation: 1147

Grep PCRE Regex Non Capturing Groups

From the following text I wish to extract the following two strings:

ip-10-x-x-x.eu-west-2.compute.interna

And

topology.kubernetes.io/zone=eu-west-2a

Full blob:

ip-10-x-x-x.eu-west-2.compute.internal   Ready    <none>   18d   v1.20.4-eks-1-20-1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/os=linux,node.app/name=all,topology.kubernetes.io/region=eu-west-2,topology.kubernetes.io/zone=eu-west-2a

Regex with Grep PCRE is being used to extract the strings.

The following regex works on https://regex101.com/

(((^ip.*?)(?=(\s)))(?:.*?)((?<=\,)(topology\.kubernetes\.io\/zone.*?)(?=(\s|$))))

But when running on on Bash v4.2 with Grep, it pulls back to full blob, rather than the regex groups, as seen here:

echo "ip-10-x-x-x.eu-west-2.compute.internal   Ready    <none>   18d   v1.20.4-eks-1-20-1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/os=linux,node.app/name=all,topology.kubernetes.io/region=eu-west-2,topology.kubernetes.io/zone=eu-west-2a" | grep -oP "(((^ip.*?)(?=(\s)))(?:.*?)((?<=\,)(topology\.kubernetes\.io\/zone.*?)(?=(\s|$))))"

What am I missing here?

Upvotes: 2

Views: 642

Answers (3)

The fourth bird
The fourth bird

Reputation: 163352

Your pattern already captures the parts that you want in groups, but it is not efficient as there are 7 capture groups where you actually only need 2 capture groups.

There are also multiple lookaround assertions that are unnecessary, and can be turned into a match instead or omitted at all.

As already commented and explained, you can not use capture groups with grep, but if you can make use of gnu awk you can use match.

awk 'match($0, /^(ip\S+).*,(topology\.kubernetes\.io\/zone\S*)/, a) {
    print a[1]
    print a[2]
}' file

Output

ip-10-x-x-x.eu-west-2.compute.internal
topology.kubernetes.io/zone=eu-west-2a

Explanation about the pattern:

^(ip\S+).*,(topology\.kubernetes\.io\/zone\S*)
  • ^ Start of string
  • (ip\S+) Capture group 1, match ip and 1+ non whitespace chars
  • .* Match the rest of the line
  • ,(topology\.kubernetes\.io\/zone\S*) Match the , and capture the string that you want after it followed by matching optional non whitspace chars using \S in group 2

If you want to output only those 2 matches, another option could be using sed and replace the whole line with the 2 capture groups:

sed -E 's/^(ip[^[:blank:]]+).*,(topology\.kubernetes\.io\/zone[^[:blank:]]*).*/\1 \2/' file

Output

ip-10-x-x-x.eu-west-2.compute.internal topology.kubernetes.io/zone=eu-west-2a

Upvotes: 0

RavinderSingh13
RavinderSingh13

Reputation: 133518

In case you are ok with awk, please try following awk program.

awk '
match($0,/^ip\S+/){
  print substr($0,RSTART,RLENGTH)
  match($0,/,topology\.kubernetes\.io\/zone\S*/)
  print substr($0,RSTART+1,RLENGTH-1)
}
'  Input_file

Explanation: Simple explanation would be, using match function of awk to match ^ip\S+ then printing its matched value. Then again using 1 more match to match regex ,topology\.kubernetes\.io\/zone\S* to get the 2nd mentioned value by OP then printing only needed output by substr function.

Upvotes: 3

tshiono
tshiono

Reputation: 22012

As Barmer comments, grep does not refer capture groups. You need to modify the regex to work with grep:

echo "ip-10-x-x-x.eu-west-2.compute.internal   Ready    <none>   18d   v1.20.4-eks-1-20-1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/os=linux,node.app/name=all,topology.kubernetes.io/region=eu-west-2,topology.kubernetes.io/zone=eu-west-2a" | grep -oP "^ip\S+|(?<=\,)topology\.kubernetes\.io\/zone\S*(?=(?:\s|$))"

Output:

ip-10-x-x-x.eu-west-2.compute.internal
topology.kubernetes.io/zone=eu-west-2a

If you want to make use of your regex as is, try ripgrep:

echo "ip-10-x-x-x.eu-west-2.compute.internal   Ready    <none>   18d   v1.20.4-eks-1-20-1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/os=linux,node.app/name=all,topology.kubernetes.io/region=eu-west-2,topology.kubernetes.io/zone=eu-west-2a" | rg --pcre2 "(((^ip.*?)(?=(\s)))(?:.*?)((?<=\,)(topology\.kubernetes\.io\/zone.*?)(?=(\s|$))))" -r '$2'$'\n''$5'

which will produce the same results.

Upvotes: 5

Related Questions