Kwaak
Kwaak

Reputation: 415

PHP regex matching recursively

I'm trying to match a certain set of tags in a template file. I however want the tags to be able to be nested in itself.

My regex is the following: (with /s)

<!-- START (.*?) -->(.*?)<!-- END \\1 -->

Tag example:

<!-- START yList -->
  y:{yList:NUM} | 
  <!-- START xList -->
    x:{xList:NUM} 
  <!-- END xList -->
  <!-- CARET xList -->
  <br>
<!-- END yList -->
<!-- CARET yList -->

Right now the matches result will be:

match 0:

group(0) (Whole match)

<!-- START yList --> 
 y 
 <!-- START xList --> 
   x 
 <!-- END xList --> 
 <!-- CARET xList --> 
 <br> 
<!-- END yList -->

group(1)

yList

group(2)

y 
<!-- START xList --> 
  x 
<!-- END xList --> 
<!-- CARET xList --> 
<br>

I want 2 matches instead of 1 obviously, the nested tag set isn't matched. Is this possible with regex, or should I just keep regexing group(2) results, untill i've found no new matches?

Upvotes: 0

Views: 226

Answers (2)

Gumbo
Gumbo

Reputation: 655189

You could do something like this:

$parts = preg_split('/(<!-- (?:START|END|CARET) [a-zA-Z][a-zA-Z0-9]* -->)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
$tokens = array();
$isTag = isset($tokens[0]) && preg_match('/^<!-- (?:START|END|CARET) [a-zA-Z][a-zA-Z0-9]* -->$/', $tokens[0]);
foreach ($parts as $part) {
    if ($isTag) {
        preg_match('/^<!-- (START|END|CARET) ([a-zA-Z][a-zA-Z0-9]*) -->$/', $token, $match);
        $tokens[] = array($match[1], $match[2]);
    } else {
        if ($token !== '') $tokens[] = $token;
    }
    $isTag = !$isTag;
}
var_dump($tokens);

That will give you the structure of your code.

Upvotes: 0

Fragsworth
Fragsworth

Reputation: 35497

Regular expressions are not suited for parsing arbitrary-depth tree structures. It may be possible to do, depending on the regex flavor you are using, but not recommended - they are difficult to read and difficult to debug as well.

I would suggest writing a simple parser instead. What you do is decompose your text into a set of possible tokens which can each be defined by simple regular expressions, e.g.:

START_TOKEN = "<!-- START [A-Za-z] -->"
END_TOKEN = ...
HTML_TEXT = ...

Iterate over your string, and as long as you match these tokens, pull them out of the string, and store them in a separate list. Be sure to save the text that was inside the token (if any) when you do this.

Then you can iterate over your list of tokens, and based on the token types you can create a nested tree structure of nodes, each containing either 1) the text of the original token, and 2) a list of child nodes.

You may want to look at some parser tutorials if this seems too complicated.

Upvotes: 5

Related Questions