wong2
wong2

Reputation: 35720

How to match some nested structure with regex?

For example, I have a string like this:

{% a %}
    {% b %}
    {% end %}
{% end %}

I want to get the content between {% a %} and {% end %}, which is {% b %} {% end %}.
I used to use {% \S+ %}(.*){% end %} to do this. But when I add c in it:

 {% a %}
        {% b %}
        {% end %}
    {% end %}
{% c %}
{% end %}

It doesn't work... How could I do this with regular expression?

Upvotes: 3

Views: 241

Answers (3)

ridgerunner
ridgerunner

Reputation: 34395

Given this test data:

$text = '
{% a %}
    {% b %}
        {% a %}
        {% end %}
    {% end %}
        {% b %}
        {% end %}
{% end %}
{% c %}
{% end %}
';

This tested script does the trick:

<?php
$re = '/
    # Match nested {% a %}{% b %}...{% end %}{% end %} structures.
    \{%[ ]\w[ ]%\}       # Opening delimiter.
    (?:                  # Group for contents alternatives.
      (?R)               # Either a nested recursive component,
    |                    # or non-recursive component stuff.
      [^{]*+             # {normal*} Zero or more non-{
      (?:                # Begin: "unrolling-the-loop"
        \{               # {special} Allow a { as long
        (?!              # as it is not the start of
          %[ ]\w[ ]%\}   # a new nested component, or
        | %[ ]end[ ]%\}  # the end of this component.
        )                # Ok to match { followed by
        [^{]*+           # more {normal*}. (See: MRE3!)
      )*+                # End {(special normal*)*} construct.
    )*+                  # Zero or more contents alternatives
    \{%[ ]end[ ]%\}      # Closing delimiter.
    /ix';
$count = preg_match_all($re, $text, $m);
if ($count) {
    printf("%d Matches:\n", $count);
    for ($i = 0; $i < $count; ++$i) {
        printf("\nMatch %d:\n%s\n", $i + 1, $m[0][$i]);
    }
}
?>

Here is the output:

2 Matches:

Match 1:
{% a %}
    {% b %}
        {% a %}
        {% end %}
    {% end %}
        {% b %}
        {% end %}
{% end %}

Match 2:
{% c %}
{% end %}

Edit: If you need to match an opening tag having more than one word char, replace the two occurrences of the \w tokens with (?!end)\w++, (as is correctly implemented in tchrist's excellent answer).

Upvotes: 4

tchrist
tchrist

Reputation: 80384

Here is a demo in Perl of an approach that works for your dataset. The same should work in PHP.

#!/usr/bin/env perl

use strict;
use warnings;

my $string = <<'EO_STRING';
    {% a %}
            {% b %}
            {% end %}
        {% end %}
    {% c %}
    {% end %}
EO_STRING


print "MATCH: $&\n" while $string =~ m{
    \{ % \s+ (?!end) \w+ \s+ % \}
    (?: (?: (?! % \} | % \} ) . ) | (?R) )*
    \{ % \s+ end \s+ % \}
}xsg;

When run, that produces this:

MATCH: {% a %}
            {% b %}
            {% end %}
        {% end %}
MATCH: {% c %}
    {% end %}

There are several other ways to write that. You may have other constraints that you haven’t shown, but this should get you started.

Upvotes: 2

Mr. Llama
Mr. Llama

Reputation: 20899

What you're looking for is called recursive regex. PHP has support for it using (?R).

I'm not familiar enough with it to be able to help you with the pattern itself, but hopefully this is a push in the right direction.

Upvotes: 0

Related Questions