AymDev
AymDev

Reputation: 7504

PHP RegEx to match nested patterns (possible recursion)

I am trying to match a pattern which may be nested.
Here is some example data where I want to extract the content inside the {{ loop ... } element:

<ul>
    {{ loop #users as #u }}
        <li>{{ #u.first_name }} {{ #u.last_name }}</li>
    {{ endloop }}
</ul>

I get it correctly with this RegEx:

/{{\s+loop\s+#([a-zA-Z_][a-zA-Z0-9_]*)((?:\.[a-zA-Z0-9_]+)*)\s+as\s+#([a-zA-Z_][a-zA-Z0-9_]*)\s+}}(.*){{\s+endloop\s+}}/sU

Explanation:

  • /
  • {{ start of open loop element
    • \s+loop\s+ loop keyword
    • #([a-zA-Z_][a-zA-Z0-9_]*) a variable name (ex: #var)
    • ((?:\.[a-zA-Z0-9_]+)*) optional variable key (ex: #var.key)
    • \s+as\s+ as keyword
    • #([a-zA-Z_][a-zA-Z0-9_]*)\s+ alias variable name (ex: #alias)
  • }} end of open loop element
  • (.*) the loop content
  • {{\s+endloop\s+}} close loop element
  • /sU

Where it fails

With nested loops, I need to get the content of the first level loop (because content is then parsed recursively in my project). Here is some example data:

 1| <ul>
 2|     {{ loop #users as #u }}
 3|         <li>
 4|             {{ #u.first_name }} {{ #u.last_name }}
 5|             <ul>
 6|                 {{ loop #u.friends as #f }}
 7|                     <li>{{ #f.first_name }} {{ #f.last_name }}</li>
 8|                 {{ endloop }}
 9|             </ul>
10|         </li>
11|     {{ endloop }}
12| </ul>
13| 
14| {{ loop #foo as #bar }}
15|     <a href="#">{{ #bar }}</a>
16| {{ endloop }}

With this content, the pattern will stop at the first {{ endloop }} encountered (lines 2-8).
If I remove the U flag (ungreedy), I can't use multiple loops as it will stop to the last {{ endloop }} even if they are different loops (lines 2-16).
I had a previous version of the pattern using the /m flag (multiline) but it failed too as it only matched the deepest level loop (lines 6-8).

I had many attempts (mostly done on regexr.com) but could not see any progress. I searched for a solution about "recursive patterns", the best I found was this question but after many attempts I could not adapt it to my project.


I am not only looking for the solution, I would really appreciate to understand how I can solve this. Link to current RegexR: regexr.com/426fd

Upvotes: 0

Views: 343

Answers (2)

revo
revo

Reputation: 48711

Here is a performance-wise fix to your problem (it takes a few hundred steps instead of evil thousand backtracking ones):

{{\s+loop\s+#(\w+)[^#]*#(\w+)\s*}}(?:[^{]*+|(?R)|{+)*{{\s+endloop\s+}}

See live demo here

RegExp breakdown:

  • {{\s+loop\s+#(\w+)[^#]*#(\w+)\s*}} Match a starting loop structure and capture hashed words
  • (?: Start of non-capturing group
    • [^{]*+ Match anything but a { possessively
    • | Or
    • (?R) Recurs whole pattern
    • | Or
    • {+ Match any number of opening braces
  • )* Match as much as possible
  • {{\s+endloop\s+}} Match an ending structure

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

Here is a quick fix of your current pattern:

{{\s+loop\s+#([a-zA-Z_]\w*)((?:\.\w+)*)\s+as\s+#([a-zA-Z_]\w*)\s*}}((?:(?!{{\s+(?:end)?loop\s).|(?R))*){{\s+endloop\s+}}

Note you do not need U modifier for this pattern to run as expected, but you still need the s modifier for . to match any char.

See the regex demo

The main difference is the replacement of .* with (?:(?!{{\s+(?:end)?loop\s).|(?R))*. It matches 0 or more repetitions of:

  • (?!{{\s+(?:end)?loop\s). - any char (.) that is not starting a sequence meeting the following pattern:
    • {{ - a {{ substring
    • \s+ - 1+ whitespaces
    • (?:end)? - an optional end substring
    • loop - a loop substring
    • \s - a whitespace
  • | - or
  • (?R) - the whole regex pattern

Besides, [a-zA-Z0-9_] is equal to \w if you are not using u modifier or (*UCP) PCRE verb, hence the whole pattern can be shortened a bit.

Upvotes: 1

Related Questions