Reputation: 17721
I have a quite silly problem, which is staggering me for a while...
I want to parse some text, formatted this way:
CUT-FROM-A ...
CUT-FROM-B ...
CUT-TO ...
CUT-TO
apple
CUT-FROM-C ...
CUT-TO
orange
In this example, I would like to extract the 'fruits', ignoring everything from CUT-FROM-X
to the corresponding TO
. By 'corresponding' I mean "from inside to outside", or if it's clearer, try mentally substiting any CUT-FROM-A
with an open bracket, and any CUT-TO
with a closed bracket: then, I want to ignore the content inside the brackets, including the brackets.
I hope this is clear, but I'm afraid it's not... :-(
I suppose the main difficulty here is that the 'closing brackets' all have the same signature, so can't be easily associated with the relative opener...
I have tried something like this (not greedy):
$output_text = preg_replace("/CUT-FROM-.*?TO/s", "", $input_text);
but this leaves the second CUT-TO
in the output...
And something like this (greedy):
$output_text = preg_replace("/CUT-FROM-.*TO/s", "", $input_text);
but this eats the first 'fruit'... :-(
This is my testing on regex101.
Anybody can shed some light on me?
Upvotes: 0
Views: 97
Reputation: 70750
Just a thought, you could process each line matching the context you want instead of replacing.
preg_match_all('~^(?!.*CUT-(?:FROM|TO)).+$~mi', $text, $matches);
var_dump($matches[0]);
Output
array(2) {
[0]=> string(5) "apple"
[1]=> string(6) "orange"
}
Upvotes: 1
Reputation: 72376
You can do this with a single regex
but you can do it better with a simple script that uses small regex
s for smaller tasks.
The idea: parse the text line by line, use regex
to identify the line type. On every 'CUT-FROM' line, add information (the line itself or something else) to a stack (using array_push()
). On every 'CUT-TO' line remove the top element from the stack (using array_pop()
.
Process other rows as you need. For example, if you need to ignore the lines between a 'CUT-FROM' and the corresponding 'CUT-TO' line you need to check that the stack is not empty to know that you are inside a pair. If the stack is empty then all the 'CUT-FROM' were paired with 'CUT-TO' lines and you are parsing lines outside of any enclosure.
This approach also provides you a nice way to detect and handle (ignore/fix/report/whatever) the errors in the input text.
Sample program:
text = <<< END_TEXT
CUT-FROM-A ...
ignore this,
CUT-FROM-B ...
this,
CUT-TO ...
and this
CUT-TO
apple
CUT-FROM-C ...
CUT-TO
orange
END_TEXT;
$lines = explode("\n", $text);
$stack = array();
foreach ($lines as $i => $line) {
// Check if it's a 'CUT-FROM-' line
if (preg_match('/^CUT-FROM-/', $line)) {
array_push($stack, $line);
continue;
}
// Check if it's a 'CUT-TO' line
if (preg_match('/^CUT-TO/', $line)) {
if (array_pop($stack) === NULL) {
// an unpaired 'CUT-TO' was found
echo("An unpaired 'CUT-TO' was found on line ".($i + 1).". Will ignore it.\n");
}
continue;
}
// A regular line
if (count($stack) > 0) {
// inside a (CUT-FROM, CUT-TO) pair
// count($stack) tells how many pairs are around this item
// ignore it
} else {
// outside any pair
echo ($line."\n");
}
}
// Check if all the 'CUT-FROM' lines were closed
if (count($stack) > 0) {
echo('Found that '.count($stack)." 'CUT-TO' lines are missing at the end of processing.\n");
}
Upvotes: 0
Reputation: 51430
Since you're asking for a regex solution, a readable recursive regex would be:
(?(DEFINE)
(?<cut>
^CUT-FROM-
(?&content)*?
^CUT-TO
)
(?<content>
(?: (?!CUT-(?:FROM-|TO)) . )++
| (?&cut)
)
)
(?&cut)
Use with the smx
options. This matches everything you want to ignore, so you can replace it with an empty string. The syntax (?&something)
means recurse into something
, it's the same as \g<something>
.
And here's a more compact version that does essentially the same thing:
^CUT-FROM-
(?:(?:(?!CUT-(?:FROM-|TO)) . )++ | (?R))*?
^CUT-TO
In this version, (?R)
means recurse the whole pattern. It still uses the smx
options. The one-liner version (without x
) would be:
(?sm)^CUT-FROM-(?:(?:(?!CUT-(?:FROM-|TO)).)++|(?R))*?^CUT-TO
But I advise against doing such things. Prefer the version with the (?(DEFINE) ... )
for readability.
Upvotes: 3