MarcoS
MarcoS

Reputation: 17721

PHP regexp: how to cut nested patterns?

I have a quite silly problem, which is staggering me for a while...
I want to parse some text, formatted this way:

CUT-FROM-A ...
CUT-FROM-B ...
CUT-TO ...
CUT-TO
apple
CUT-FROM-C ...
CUT-TO
orange

In this example, I would like to extract the 'fruits', ignoring everything from CUT-FROM-X to the corresponding TO. By 'corresponding' I mean "from inside to outside", or if it's clearer, try mentally substiting any CUT-FROM-A with an open bracket, and any CUT-TO with a closed bracket: then, I want to ignore the content inside the brackets, including the brackets.
I hope this is clear, but I'm afraid it's not... :-(
I suppose the main difficulty here is that the 'closing brackets' all have the same signature, so can't be easily associated with the relative opener...

I have tried something like this (not greedy):

$output_text = preg_replace("/CUT-FROM-.*?TO/s", "", $input_text);

but this leaves the second CUT-TO in the output...

And something like this (greedy):

$output_text = preg_replace("/CUT-FROM-.*TO/s", "", $input_text);

but this eats the first 'fruit'... :-(

This is my testing on regex101.

Anybody can shed some light on me?

Upvotes: 0

Views: 97

Answers (3)

hwnd
hwnd

Reputation: 70750

Just a thought, you could process each line matching the context you want instead of replacing.

preg_match_all('~^(?!.*CUT-(?:FROM|TO)).+$~mi', $text, $matches);
var_dump($matches[0]);

Output

array(2) {
  [0]=> string(5) "apple"
  [1]=> string(6) "orange"
}

Upvotes: 1

axiac
axiac

Reputation: 72376

You can do this with a single regex but you can do it better with a simple script that uses small regexs for smaller tasks.

The idea: parse the text line by line, use regex to identify the line type. On every 'CUT-FROM' line, add information (the line itself or something else) to a stack (using array_push()). On every 'CUT-TO' line remove the top element from the stack (using array_pop().

Process other rows as you need. For example, if you need to ignore the lines between a 'CUT-FROM' and the corresponding 'CUT-TO' line you need to check that the stack is not empty to know that you are inside a pair. If the stack is empty then all the 'CUT-FROM' were paired with 'CUT-TO' lines and you are parsing lines outside of any enclosure.

This approach also provides you a nice way to detect and handle (ignore/fix/report/whatever) the errors in the input text.

Sample program:

text = <<< END_TEXT
CUT-FROM-A ...
ignore this,
CUT-FROM-B ...
this,
CUT-TO ...
and this
CUT-TO
apple
CUT-FROM-C ...
CUT-TO
orange
END_TEXT;

$lines = explode("\n", $text);


$stack = array();
foreach ($lines as $i => $line) {
    // Check if it's a 'CUT-FROM-' line
    if (preg_match('/^CUT-FROM-/', $line)) {
        array_push($stack, $line);
        continue;
    }

    // Check if it's a 'CUT-TO' line
    if (preg_match('/^CUT-TO/', $line)) {
        if (array_pop($stack) === NULL) {
            // an unpaired 'CUT-TO' was found
            echo("An unpaired 'CUT-TO' was found on line ".($i + 1).". Will ignore it.\n");
        }
        continue;
    }


    // A regular line
    if (count($stack) > 0) {
        // inside a (CUT-FROM, CUT-TO) pair
        // count($stack) tells how many pairs are around this item

        // ignore it

    } else {
        // outside any pair
        echo ($line."\n");
    }
}

// Check if all the 'CUT-FROM' lines were closed
if (count($stack) > 0) {
    echo('Found that '.count($stack)." 'CUT-TO' lines are missing at the end of processing.\n");
}

Upvotes: 0

Lucas Trzesniewski
Lucas Trzesniewski

Reputation: 51430

Since you're asking for a regex solution, a readable recursive regex would be:

(?(DEFINE)
  (?<cut>
    ^CUT-FROM-
    (?&content)*?
    ^CUT-TO
  )

  (?<content>
    (?: (?!CUT-(?:FROM-|TO)) . )++
    | (?&cut)
  )
)

(?&cut)

Demo

Use with the smx options. This matches everything you want to ignore, so you can replace it with an empty string. The syntax (?&something) means recurse into something, it's the same as \g<something>.

And here's a more compact version that does essentially the same thing:

^CUT-FROM-
(?:(?:(?!CUT-(?:FROM-|TO)) . )++ | (?R))*?
^CUT-TO

Demo

In this version, (?R) means recurse the whole pattern. It still uses the smx options. The one-liner version (without x) would be:

(?sm)^CUT-FROM-(?:(?:(?!CUT-(?:FROM-|TO)).)++|(?R))*?^CUT-TO

But I advise against doing such things. Prefer the version with the (?(DEFINE) ... ) for readability.

Upvotes: 3

Related Questions