Cholthi Paul Ttiopic
Cholthi Paul Ttiopic

Reputation: 936

PHP regex pattern match recursive

I have this in a function which is supposed to replace any sequence of parentheses with what is enclosed in it like (abc) becomes abc any where it appears even recursively because parens can be nested.

$return =   preg_replace_callback(
    '|(\((.+)\))+|',
    function ($matches) {
        return $matches[2];
    },
    $s
);

when the above regex is fed this string "a(bcdefghijkl(mno)p)q" as input it returns "ap)onm(lkjihgfedcbq". This shows the regex is matched once. What can I do to make it continue to match even inside already made matches and produce this `abcdefghijklmnopq'"

Upvotes: 2

Views: 825

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626728

To match balanced parenthetical substrings you may use a well-known \((?:[^()]++|(?R))*\) pattern (described in Matching Balanced Constructs), inside a preg_replace_callback method, where the match value can be further manipulated (just remove all ( and ) symbols from the match that is easy to do even without a regex:

$re = '/\((?:[^()]++|(?R))*\)/';
$str = 'a(bcdefghijkl(mno)p)q((('; // Added three ( at the end
$result = preg_replace_callback($re, function($m) {
    return str_replace(array('(',')'), '', $m[0]);
}, $str);
echo $result; // => abcdefghijklmnopq(((

See the PHP demo

To get overlapping matches, you need to use a known technique, capturing inside a positive lookahead, but you won't be able to perform two operations at once (replacing and matching), you can run matching first, and then replace:

$re = '/(?=(\((?:[^()]++|(?1))*\)))/';
$str = 'a(bcdefghijkl(mno)p)q(((';
preg_match_all($re, $str, $m);
print_r($m[1]);
// => Array ( [0] => (bcdefghijkl(mno)p)  [1] => (mno) )

See the PHP demo.

Upvotes: 2

Roger Gee
Roger Gee

Reputation: 871

It is slightly unclear exactly what the postcondition of the algorithm is supposed to be. It seems to me that you are wanting to strip out matching pairs of ( ). The assumption here is that unmatched parentheses are left alone (otherwise you just strip out all of the ('s and )'s).

So I guess this means the input string a(bcdefghijkl(mno)p)q becomes abcdefghijklmnopq but the input string a(bcdefghijkl(mno)pq becomes a(bcdefghijklmnopq. Likewise an input string (a)) would become a).

It may be possible to do this using pcre since it does provide some non-regular features but I'm doubtful about it. The language of the input strings is not regular; it's context-free. What @ArtisticPhoenix's answer does is match complete pairs of matched parentheses. What it does not do is match all nested pairs. This nested matching is inherently non-regular in my humble understanding of language theory.

I suggest writing a parser to strip out the matching pairs of parentheses. It gets a little wordy having to account for productions that fail to match:

<?php

// Parse the punctuator sub-expression (i.e. anything within ( ... ) ).
function parse_punc(array $tokens,&$iter) {
    if (!isset($tokens[$iter])) {
        return;
    }

    $inner = parse_punc_seq($tokens,$iter);
    if (!isset($tokens[$iter]) || $tokens[$iter] != ')') {
        // Leave unmatched open parentheses alone.
        $inner = "($inner";
    }

    $iter += 1;
    return $inner;
}

// Parse a sequence (inside punctuators).
function parse_punc_seq(array $tokens,&$iter) {
    if (!isset($tokens[$iter])) {
        return;
    }

    $tok = $tokens[$iter];
    if ($tok == ')') {
        return;
    }
    $iter += 1;

    if ($tok == '(') {
        $tok = parse_punc($tokens,$iter);
    }

    $tok .= parse_punc_seq($tokens,$iter);
    return $tok;
}

// Parse a sequence (outside punctuators).
function parse_seq(array $tokens,&$iter) {
    if (!isset($tokens[$iter])) {
        return;
    }

    $tok = $tokens[$iter++];
    if ($tok == '(') {
        $tok = parse_punc($tokens,$iter);
    }

    $tok .= parse_seq($tokens,$iter);
    return $tok;
}

// Wrapper for parser.
function parse(array $tokens) {
    $iter = 0;
    return strval(parse_seq($tokens,$iter));
}

// Grab input from stdin and run it through the parser.
$str = trim(stream_get_contents(STDIN));
$tokens = preg_split('/([\(\)])/',$str,-1,PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
var_dump(parse($tokens));

I know this is a lot more code than a regex one-liner but it does solve the problem as I understand it. I'd be interested to know if anyone can solve this problem with a regular expression.

Upvotes: 1

ArtisticPhoenix
ArtisticPhoenix

Reputation: 21661

Try this one,

  preg_match('/\((?:[^\(\)]*+|(?0))*\)/', $str )

https://regex101.com/r/NsQSla/1

It will match everything inside of the ( ) as long as they are matched pairs.

Example

(abc) (abc (abc))

will have the following matches

Match 1
   Full match   0-5 `(abc)`
Match 2
   Full match   6-17    `(abc (abc))`

Upvotes: 1

Related Questions