Matching text that is not within the curly brackets, while also capturing the brackets after

Question

My situation requires recursion, and I'm able to match what's in the curly brackets already the way I need it, but I'm unable to capture the surrounding text.

So this would be the example text:

This is foo {{foo}} and {{bar.function({{demo.funtion({{inner}} == "demo")}} and {{bar}} or "foo")}} more_text {{foo

And I need my result to look like this:

0       =>      This is foo 
1       =>      {{foo}}
2       =>       and 
3       =>      {{bar.function({{demo.funtion({{inner}} == "demo")}} and {{bar}} or "foo")}}
4       =>       more_text {{foo

With this: (\{\{([^{{}}]|(?R))*\}\}) I have been able to match {{foo}} and {{bar.function({{demo.funtion({{inner}} == "demo")}} and {{bar}} or "foo")}} very nicely, but not the surrounding text to achieve the result that I need.

I have tried many things, but without success.

Wiktor Stribiżew · Accepted Answer

You may use the following solution based on the preg_split and PREG_SPLIT_DELIM_CAPTURE flag:

$re = '/({{(?:[^{}]++|(?R))*}})/';
$str = 'This is foo {{foo}} and {{bar.function({{demo.funtion({{inner}} == "demo")}} and {{bar}} or "foo")}} more_text {{foo';
$res = preg_split($re, $str, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($res);
// => Array
(
    [0] => This is foo 
    [1] => {{foo}}
    [2] =>  and 
    [3] => {{bar.function({{demo.funtion({{inner}} == "demo")}} and {{bar}} or "foo")}}
    [4] =>  more_text {{foo
)

See the PHP demo.

The whole pattern is captured with the outer capturing group, that is why when adding PREG_SPLIT_DELIM_CAPTURE this text (that is split upon) is added to the output array.

If there are unwanted empty elements, PREG_SPLIT_NO_EMPTY flag will discard them.

More details:

Pattern: I removed unnecessary escapes and symbols from your pattern as you do not have to escape { and } in PHP regex when the context is enough for the rege engine to deduce the { meaning you do not need to escape } at all in all contexts). Note that [{}] is the same as [{{}}], both will match a single char that is either a { or }, no matter how many { and } you put into the character class. I also enhanced its performance by turning the + greedy quantifier into a possessive quantifier ++.

Details:

( - Group 1 start:
- {{ - 2 consecutive {s
- (?:[^{}]++|(?R))* - 0 or more sequences of:
  - [^{}]++ - 1 or more symbols other than { and } (no backtracking into this pattern is allowed)
  - | - or
  - (?R) - try matching the whole pattern
}} - a }} substring
) - Group 1 end.

PHP part:

When tokenizing a string using just one token type, it is easy to use a splitting approach. Since preg_split in PHP can split on a regex while keeping the text that is matched, it is ideal for this kind of task.

The only trouble is that empty entries might crawl into the resulting array if the matches appear to be consecutive or at the start/end of the string. Thus, PREG_SPLIT_NO_EMPTY is good to use here.

Matching text that is not within the curly brackets, while also capturing the brackets after

Answers (2)

Related Questions