Reputation: 3756
I've got a regex:
~(?P<opening>{(?P<inverse>[!])?block:(?P<name>[a-z0-9\s_-]+)(:(?P<function>[a-z0-9\s_-]+)([\s]?\((?P<params>[^)]*)\))?)?})(?P<contents>[^{]*(?:\{(?!/?block:[a-z0-9\s_-]+\})[^{]*)*)(?P<closing>{/block:(?P=name)})~is
Which attempts to match the following:
<ul>{block:menu}
<li><a href="{var:link}">{var:title}</a>
{/block:menu}</ul>
Which works fine, however when the 3rd part of the block tag is introduced e.g.: {block:menu:thirdbit}
it fails to match it, however if you chop off the end of the regex to trim it down to just the following it does match implying the pattern is OK but something else has gone wrong:
(?P<opening>{(?P<inverse>[!])?block:(?P<name>[a-z0-9\s_-]+)(:(?P<function>[a-z0-9\s_-]+)([\s]?\((?P<params>[^)]*)\))?)?})
Any ideas what's going wrong?
Upvotes: 0
Views: 96
Reputation: 34395
First as Tim correctly pointed out - it is unwise to parse HTML with regex.
Second: As presented, the regex in the question is unreadable. I've taken the liberty of reformatting it. Here is a working script which includes a commented readable version of the exact same regex:
<?php // test.php Rev:20120830_1300
$re = '%
# Match a non-nested "{block:name:func(params)}...{/block:name}" structure.
(?P<opening> # $1: == $opening: BLOCK start tag.
{ # BLOCK tag opening literal "{"
(?P<inverse>[!])? # $2: == $inverse: Optional "!" negation.
block: # Opening BLOCK tag ident.
(?P<name>[a-z0-9\s_-]+) # $3: == $name: BLOCK element name.
( # $4: Optional BLOCK function.
: # Function name preceded with ":".
(?P<function>[a-z0-9\s_-]+) # $function: Function name.
( # $5: Optional function parameters.
[\s]? # Allow one whitespace before (params).
\( # Literal "(" params opening char.
(?P<params>[^)]*) # $6: == $params: function parameters.
\) # Literal ")" params closing char.
)? # End $5: Optional function parameters.
)? # End $4: Optional BLOCK function.
} # BLOCK tag closing literal "}"
) # End $1: == $opening: BLOCK start tag.
(?P<contents> # $contents: BLOCK element contents.
[^{]* # {normal) Zero or more non-"{"
(?: # Begin {(special normal*)*} construct.
\{ # {special} Allow a "{" but only if it is
(?!/?block:[a-z0-9\s_-]+\}) # not a BLOCK tag opening literal "{".
[^{]* # More {normal}
)* # Finish "Unrolling-the-Loop" (See: MRE3).
) # End $contents: BLOCK element contents.
(?P<closing> # $closing: BLOCK element end tag.
{ # BLOCK tag opening literal "{"
/block: # Closing BLOCK tag ident.
(?P=name) # Close name must match open name.
} # BLOCK tag closing literal "}"
) # End $closing: BLOCK element end tag.
%six';
$text = file_get_contents('testdata.html');
if (preg_match($re, $text, $matches)) print_r($matches);
else echo("no match!");
?>
Note that the additional indentation and comments allow one to actually understand what the regex is attempting to do. My testing shows that there is nothing wrong with the regex and it works as advertised. It even implements Jeffrey Friedl's advanced "Unrolling-the-Loop" efficiency technique, so whoever wrote this has some real regex skills.
e.g. Given the following data taken from the original question:
<ul>{block:menu}
<li><a href="{var:link}">{var:title}</a>
{/block:menu}</ul>
Here is the (correct) output from the script:
'''
Array
(
[0] => {block:menu}
<li><a href="{var:link}">{var:title}</a>
{/block:menu}
[opening] => {block:menu}
[1] => {block:menu}
[inverse] =>
[2] =>
[name] => menu
[3] => menu
[4] =>
[function] =>
[5] =>
[6] =>
[params] =>
[7] =>
[contents] =>
<li><a href="{var:link}">{var:title}</a>
[8] =>
<li><a href="{var:link}">{var:title}</a>
[closing] => {/block:menu}
[9] => {/block:menu}
)
'''
It also works when the optional function
and params
are included in the test data.
That said, there are a few issues I have with the question/regex:
{
and }
are metacharacters and should be escaped (although PCRE is able to correctly determine that they should be interpreted literally in this case).Upvotes: 1
Reputation: 197777
Just an idea: Transpose all {block:menu}
and similar elements into XML elements in their own namespace. You can then use xpath and the job is done. You should even be able to do that on the fly.
Upvotes: 1