Ahmed Ghoneim
Ahmed Ghoneim

Reputation: 7067

Java Regex too greedy capturing groups

    Node0x7fd34984d728:s1 -> Node0x7fd34984d600:d0;
    Node0x7fd34984d850 [shape=record,shape=Mrecord,label="{Register %vreg13|0x7fd34984d850|{<d0>i32}}"];
    Node0x7fd34984d978 [shape=record,shape=Mrecord,label="{{<s0>0|<s1>1}|CopyFromReg [ORD=1]|0x7fd34984d978|{<d0>i32|<d1>ch}}"];
    Node0x7fd34984d978:s0 -> Node0x7fd3486095f0:d0[color=blue,style=dashed];
    Node0x7fd34984d978:s1 -> Node0x7fd34984d850:d0;
    Node0x7fd34984daa0 [shape=record,shape=Mrecord,label="{Register %vreg14|0x7fd34984daa0|{<d0>i32}}"];

I'm trying to capture only Nodes with "ORD" keyword, my simple Regex pattern is:

Node.+?label=\"\\{\\{(?<SRC><s[0-9]+?>[a-z0-9]+?)\\}|(?<NAME>.+?)\\[ORD=(?<ORD>[0-9]+?)\\]\\|(?<ID>[A-Za-z0-9]{14})|\\{(?<DEST><d[0-9]+?>[a-z0-9]+?)\\}\\}\"\\];

It's too greedy capturing wrong groups.

The following snippet is captured as one group!

Node0x7fd34984d728:s1 -> Node0x7fd34984d600:d0;
Node0x7fd34984d850 [shape=record,shape=Mrecord,label="{Register %vreg13|0x7fd34984d850|{<d0>i32}}"];
Node0x7fd34984d978 [shape=record,shape=Mrecord,label="{{<s0>0|<s1>1}|CopyFromReg [ORD=1]|0x7fd34984d978|{<d0>i32|<d1>ch}}"];

However it must only capture:

Node0x7fd34984d978 [shape=record,shape=Mrecord,label="{{<s0>0|<s1>1}|CopyFromReg [ORD=1]|0x7fd34984d978|{<d0>i32|<d1>ch}}"];

as it's the only Node has "ORD" keyword before Semicolon

Upvotes: 3

Views: 106

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626754

You need to get rid of any lazy and dot matching patterns and replace them with negated character classes. That way, you will prevent "overflowing" between parts of your substrings.

String pattern = "Node[^\\]\\[]*\\[[^\\]\\[]*label=\"\\{\{(?<SRC>[^{}]*)\\}\\|(?<NAME>\\w+)\\s*\\[ORD=(?<OR‌​D>\\d+)\\]\\|(?<ID>[^|]*)\\|\\{(?<DEST>[^{}]*)\\}\\}\"\\];";

See demo

Upvotes: 1

alpha bravo
alpha bravo

Reputation: 7948

I suggest to not use one monster pattern but two simple patterns to extract what you want
use this pattern first:

^Node.*?label="{(.*\bORD\b.*)}".*?;

to extract "only Node has "ORD" keyword before Semicolon"
{<s0>0|<s1>1}|CopyFromReg [ORD=1]|0x7fd34984d978|{<d0>i32|<d1>ch}
Demo

then use this pattern

({.+?}|[^\|]+(?=\[ORD=\d+\])|[^\|]+)

for your various capturing groups - they are numbered not named though.
Demo
results :

MATCH 1 {<s0>0|<s1>1}
MATCH 2 CopyFromReg
MATCH 3 [ORD=1]
MATCH 4 0x7fd34984d978
MATCH 5 {<d0>i32|<d1>ch}

Upvotes: 1

Related Questions