Reputation: 741

Returning multiple matches but only till first occurrence of a pattern using PHP and RegEX

I am having a dataset that looks like

I(0,123...789){
A(0,567...999){.......n=Marc.....}
B(2,655...265){..................}
C(3,993...333){..................}
M(8,635...254){.................;}
}
O(0,345...789){
A(0,567...999){.......n=Marc.....}
B(2,876...775){..................}
C(3,993...549){..................}
M(8,354...987){.................;}
}
I(0,987...764){
A(0,567...999){.......n=Marc.....}
B(2,543...265){..................}
C(7,998...933){..................}
M(8,645...284){.................;}
}
B(0,123...789){
.......
}
I(0,987...764){
A(0,567...999){.......n=John.....}
B(2,543...265){..................}
C(7,998...933){..................}
M(8,645...284){.................;}
}

I am trying to return all I "sections" so starting from "I" until the closing tag that comes after the ;} but only if the "I" section contains n=Marc.

So far I came with

^([I]\(.*\){.*n=Marc.*^[M]\(.*;}.)}

https://regex101.com/r/VSuZh5/1

However in some cases, when data has a pattern like

I(0,123...789){
A(0,567...999){.......n=Marc.....}
B(2,655...265){..................}
C(3,993...333){..................}
M(8,635...254){.................;}
}
O(0,345...789){
A(0,567...999){.......n=Marc.....}
B(2,876...775){..................}
C(3,993...549){..................}
M(8,354...987){.................;}
}

The regular expression returns both the I and O section. Is there a way to make sure it always return the I section?

apologies for the dataset, it's huge and contains a lot of sensitive data which I can't make public.*

Upvotes: 2

Answers (3)

bobble bubble

Reputation: 18490

If I knew, the input was always be formatted like sample, would rather split into chunks at a closing } at start of line, followed by a newline if followed by an upper: ^}\R(?=[A-Z]).

Then find the items starting with I and containing n=Marc by use of preg_grep.

$res = preg_grep('/^I.*n=Marc/s', preg_split('/^}\R(?=[A-Z])/m', $str));

See PHP demo at 3v4l.org

In your pattern the .* can skip over undesired items resulting in unexpected matches.

Upvotes: 2

The fourth bird

Reputation: 163277

One option might be to match I, then match all the lines that do not start with } and match at least 1 line that contains n=Marc

^I\([^()]*\){(?:\R(?!}|.*n=Marc).*)*\R.*\bn=Marc\b.*(?:\R(?!}).*)*\R}$

Explanation

^ Start of string
I$[^()]*${ Match I followed by (...){
(?: Non capturing group
- \R(?!}|.*n=Marc) Match unicode newline sequence, assert what is on the right is not } or that the line contains n=Marc
- .* Match any char 0+ times
)* close non capturing group and repeat 0+ times
\R Match unicode newline sequence
.*\bn=Marc\b.* Match any char 0+ times and match n=Marc between word boundaries
(?: non capturing group
- \R(?!}).* Match newline sequence asserting what is on the right is not }
)* Close non capturing group and repeat 0+ times
\R Match newline sequence
} Match closing }
$ End of string

Regex demo

Upvotes: 3

Emma

Reputation: 27723

My guess is that we want an expression to return the O section that has n=Marc in it, something similar to:

(?=O\()([\s\S]*?n=Marc[\s\S]*?;}\s*})

Or maybe:

(?=O\()([\s\S]*?n=Marc[\s\S]*?;})\s*}

Demo 1

For I sections we'd simply change O to I:

(?=I\()([\s\S]*?n=Marc[\s\S]*?;})\s*}

Demo 2

Test

$re = '/(?=I\()([\s\S]*?n=Marc[\s\S]*?;})\s*}/m';
$str = 'I(0,123...789){
A(0,567...999){.......n=Marc.....}
B(2,655...265){..................}
C(3,993...333){..................}
M(8,635...254){.................;}
}
O(0,345...789){
A(0,567...999){.......n=Marc.....}
B(2,876...775){..................}
C(3,993...549){..................}
M(8,354...987){.................;}
}
I(0,987...764){
A(0,567...999){.......n=Marc.....}
B(2,543...265){..................}
C(7,998...933){..................}
M(8,645...284){.................;}
}
B(0,123...789){
.......
}
I(0,987...764){
A(0,567...999){.......n=John.....}
B(2,543...265){..................}
C(7,998...933){..................}
M(8,645...284){.................;}
}';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

foreach ($matches as $key => $I) {
    echo $I[0] . "\n";
}

Output

I(0,123...789){
A(0,567...999){.......n=Marc.....}
B(2,655...265){..................}
C(3,993...333){..................}
M(8,635...254){.................;}
}
I(0,987...764){
A(0,567...999){.......n=Marc.....}
B(2,543...265){..................}
C(7,998...933){..................}
M(8,645...284){.................;}
}

Upvotes: 0