aborted
aborted

Reputation: 4541

Matching text that is not within the curly brackets, while also capturing the brackets after

My situation requires recursion, and I'm able to match what's in the curly brackets already the way I need it, but I'm unable to capture the surrounding text.

So this would be the example text:

This is foo {{foo}} and {{bar.function({{demo.funtion({{inner}} == "demo")}} and {{bar}} or "foo")}} more_text {{foo

And I need my result to look like this:

0       =>      This is foo 
1       =>      {{foo}}
2       =>       and 
3       =>      {{bar.function({{demo.funtion({{inner}} == "demo")}} and {{bar}} or "foo")}}
4       =>       more_text {{foo

With this: (\{\{([^{{}}]|(?R))*\}\}) I have been able to match {{foo}} and {{bar.function({{demo.funtion({{inner}} == "demo")}} and {{bar}} or "foo")}} very nicely, but not the surrounding text to achieve the result that I need.

I have tried many things, but without success.

Upvotes: 1

Views: 47

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627507

You may use the following solution based on the preg_split and PREG_SPLIT_DELIM_CAPTURE flag:

$re = '/({{(?:[^{}]++|(?R))*}})/';
$str = 'This is foo {{foo}} and {{bar.function({{demo.funtion({{inner}} == "demo")}} and {{bar}} or "foo")}} more_text {{foo';
$res = preg_split($re, $str, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($res);
// => Array
(
    [0] => This is foo 
    [1] => {{foo}}
    [2] =>  and 
    [3] => {{bar.function({{demo.funtion({{inner}} == "demo")}} and {{bar}} or "foo")}}
    [4] =>  more_text {{foo
)

See the PHP demo.

The whole pattern is captured with the outer capturing group, that is why when adding PREG_SPLIT_DELIM_CAPTURE this text (that is split upon) is added to the output array.

If there are unwanted empty elements, PREG_SPLIT_NO_EMPTY flag will discard them.

More details:

Pattern: I removed unnecessary escapes and symbols from your pattern as you do not have to escape { and } in PHP regex when the context is enough for the rege engine to deduce the { meaning you do not need to escape } at all in all contexts). Note that [{}] is the same as [{{}}], both will match a single char that is either a { or }, no matter how many { and } you put into the character class. I also enhanced its performance by turning the + greedy quantifier into a possessive quantifier ++.

Details:

  • ( - Group 1 start:
    • {{ - 2 consecutive {s
    • (?:[^{}]++|(?R))* - 0 or more sequences of:
      • [^{}]++ - 1 or more symbols other than { and } (no backtracking into this pattern is allowed)
      • | - or
      • (?R) - try matching the whole pattern
  • }} - a }} substring
  • ) - Group 1 end.

PHP part:

When tokenizing a string using just one token type, it is easy to use a splitting approach. Since preg_split in PHP can split on a regex while keeping the text that is matched, it is ideal for this kind of task.

The only trouble is that empty entries might crawl into the resulting array if the matches appear to be consecutive or at the start/end of the string. Thus, PREG_SPLIT_NO_EMPTY is good to use here.

Upvotes: 1

ArtisticPhoenix
ArtisticPhoenix

Reputation: 21681

I would use a pattern like this

$patt = '/(?P<open>\{\{)|(?P<body>[-0-9a-zA-Z._]+)|(?P<whitespace>\s+)|(?<opperators>and|or|==)|(?P<close>\}\})/'

 preg_match_all( $patt, $text, $matches );

The output is far to long but you can loop over it and then match items up, basically it's tokeninzing the string.

Its like this

array (
0 => 
    array (
        0 => '{{',
        1 => 'bar.function',
        2 => '{{',
        3 => 'demo.funtion',
        4 => '{{',
        5 => 'inner',
        6 => '}}',
        7 => ' ',
        8 => '==',
        9 => ' ',
        10 => 'demo',
        11 => '}}',
        12 => ' ',
        13 => 'and',
        14 => ' ',
        15 => '{{',
        16 => 'bar',
        17 => '}}',
        18 => ' ',
        19 => 'or',
        20 => ' ',
        21 => 'foo',
        22 => '}}',
    ),
'open' => 
    array (
        0 => '{{',
        1 => '',
        2 => '{{',
        3 => '',
        4 => '{{',
        5 => '',
        6 => '',
        7 => '',
        8 => '',
        9 => '',
        10 => '',
        11 => '',
        12 => '',
        13 => '',
        14 => '',
        15 => '{{',
        16 => '',
        17 => '',
        18 => '',
        19 => '',
        20 => '',
        21 => '',
        22 => '',
    ), 
),
'body' => 
    array (
        0 => '',
        1 => 'bar.function',
        2 => '',
        3 => 'demo.funtion',
        4 => '',
        5 => 'inner',
        6 => '',
        ....
   )
 )

Then in a loop you can tell match [0][0] is open tag, match [0][1] is body match [0][3] is another open etc. and by keeping track of open and close tags you can work out the nesting. It will tell you what is an open match body match close match operator match etc...

Every thing you need, I don't have time for a full workup on a solution...

A quick example would be an open followed by a body followed by a close is a variable. And an open followed by and body and another open is a function. p You can also add additional patterns by inserting like this (?P<function>function\.) with the pipe in there like '/(?P<open>\{\{)|(?P<function>function\.)|... . Then you could pick up keywords like function foreach block etc... what have you.

I've written full fledged template systems with this method. In my template system I build the RegX in an array like this

  [ 'open' => '\{\{', 'function' => 'function\.', .... ]

And then compress it to the actual regx, makes life easy...

   $r = [];
  foreach( $patt_array as $key=>$value ){
     $r[] = '(?P<'.$key.'>'.$value.')';
  }

   $patt = '/'.implode('|', $r ).'/';

Etc...

If you follow.

Upvotes: 1

Related Questions