Reputation: 7504
I am trying to match a pattern which may be nested.
Here is some example data where I want to extract the content inside the {{ loop ... }
element:
<ul>
{{ loop #users as #u }}
<li>{{ #u.first_name }} {{ #u.last_name }}</li>
{{ endloop }}
</ul>
I get it correctly with this RegEx:
/{{\s+loop\s+#([a-zA-Z_][a-zA-Z0-9_]*)((?:\.[a-zA-Z0-9_]+)*)\s+as\s+#([a-zA-Z_][a-zA-Z0-9_]*)\s+}}(.*){{\s+endloop\s+}}/sU
Explanation:
/
{{
start of open loop element
\s+loop\s+
loop keyword#([a-zA-Z_][a-zA-Z0-9_]*)
a variable name (ex:#var
)((?:\.[a-zA-Z0-9_]+)*)
optional variable key (ex: #var.key
)\s+as\s+
as keyword#([a-zA-Z_][a-zA-Z0-9_]*)\s+
alias variable name (ex:#alias
)}}
end of open loop element(.*)
the loop content{{\s+endloop\s+}}
close loop element/sU
With nested loops, I need to get the content of the first level loop (because content is then parsed recursively in my project). Here is some example data:
1| <ul>
2| {{ loop #users as #u }}
3| <li>
4| {{ #u.first_name }} {{ #u.last_name }}
5| <ul>
6| {{ loop #u.friends as #f }}
7| <li>{{ #f.first_name }} {{ #f.last_name }}</li>
8| {{ endloop }}
9| </ul>
10| </li>
11| {{ endloop }}
12| </ul>
13|
14| {{ loop #foo as #bar }}
15| <a href="#">{{ #bar }}</a>
16| {{ endloop }}
With this content, the pattern will stop at the first {{ endloop }}
encountered (lines 2-8).
If I remove the U
flag (ungreedy), I can't use multiple loops as it will stop to the last {{ endloop }}
even if they are different loops (lines 2-16).
I had a previous version of the pattern using the /m
flag (multiline) but it failed too as it only matched the deepest level loop (lines 6-8).
I had many attempts (mostly done on regexr.com) but could not see any progress. I searched for a solution about "recursive patterns", the best I found was this question but after many attempts I could not adapt it to my project.
(?R)
but haven't succeed to use it, would it be helpful in my case ?I am not only looking for the solution, I would really appreciate to understand how I can solve this. Link to current RegexR: regexr.com/426fd
Upvotes: 0
Views: 343
Reputation: 48711
Here is a performance-wise fix to your problem (it takes a few hundred steps instead of evil thousand backtracking ones):
{{\s+loop\s+#(\w+)[^#]*#(\w+)\s*}}(?:[^{]*+|(?R)|{+)*{{\s+endloop\s+}}
See live demo here
RegExp breakdown:
{{\s+loop\s+#(\w+)[^#]*#(\w+)\s*}}
Match a starting loop structure and capture hashed words(?:
Start of non-capturing group
[^{]*+
Match anything but a {
possessively|
Or(?R)
Recurs whole pattern|
Or{+
Match any number of opening braces)*
Match as much as possible{{\s+endloop\s+}}
Match an ending structureUpvotes: 2
Reputation: 626738
Here is a quick fix of your current pattern:
{{\s+loop\s+#([a-zA-Z_]\w*)((?:\.\w+)*)\s+as\s+#([a-zA-Z_]\w*)\s*}}((?:(?!{{\s+(?:end)?loop\s).|(?R))*){{\s+endloop\s+}}
Note you do not need U
modifier for this pattern to run as expected, but you still need the s
modifier for .
to match any char.
See the regex demo
The main difference is the replacement of .*
with (?:(?!{{\s+(?:end)?loop\s).|(?R))*
. It matches 0 or more repetitions of:
(?!{{\s+(?:end)?loop\s).
- any char (.
) that is not starting a sequence meeting the following pattern:
{{
- a {{
substring\s+
- 1+ whitespaces(?:end)?
- an optional end
substring loop
- a loop
substring\s
- a whitespace|
- or(?R)
- the whole regex patternBesides, [a-zA-Z0-9_]
is equal to \w
if you are not using u
modifier or (*UCP)
PCRE verb, hence the whole pattern can be shortened a bit.
Upvotes: 1