Reputation: 3
I've been bashing my head against a brick wall all day over trying to get an optional group to work in a preg_match_all() regular expression. The non-optional version parses the data perfectly, but as soon as I make one part of the regex optional, that optional part is never used to parse the data, even if the line it is targeting is present in the data.
This is the original regex that works:
$regex = "~:begin(.*)[\r\n]+:desc(.*)[\r\n]+(.*)[\r\n]+:end(?:.*)[\r\n]+~msU";
preg_match_all($regex, $text, $matches);
This is the text being parsed:
:begin test
:desc testing
some code
more code
last code
:end test
:begin test2
:desc testing2
some code2
last code2
:end test2
That regex parses the lines beginning with ":desc" correctly into its own group, but when I make the ":desc" line optional, the same group is always empty and the line is added to the following group instead, at the beginning of the "code" block.
This is the adjusted regex with the optional group for desc:
$regex = "~:begin(.*)[\r\n]+(:desc(.*)[\r\n]+)?(.*)[\r\n]+:end(?:.*)[\r\n]+~msU";
I believe I understand what's happening -- just not why or how to fix the problem. Clearly, because there isn't a definite marker of some kind at the beginning of the code block, when the preceding line is made optional, the regex is bypassing the optional group and lumping it all in with the code block that follows. I've tried playing with the flags, changing the groups to all kinds of combinations of greedy/non-greedy, but without inserting something like a ":code" prefix to indicate the start of the next block, I just can't stop the regex from placing the optional line into the code block after it.
I just want to be able to make the single-line :desc statement optional, without having to add more tags or delimiters to the data.
At this point, I'm stuck, and need some veteran regex expert to explain what's going on, and how to fix it (if possible).
Upvotes: 0
Views: 1384
Reputation: 2422
Negative lookahead can help here:
~:begin (.*)[\r\n]+(?::desc (.*)[\r\n]+)?^(?!:desc)(?:(.*)[\r\n]+)?:end(?:.*)[\r\n]+~msU
Main part that was added: ^(?!:desc)
- this will check that the next line does not start with :desc
I also added (?:...)
for the optional groups, so they are not captured for the result array. Remove these if necessary.
What exactly does the negative lookahead do? The main problem with multiline and (.)* is that the dot matches (almost!) any character. And almost means, with the exception of newline (Details). But as your regex is using "multi-line mode", this makes this more tricky.
Let's break down your second regex into smaller parts:
:begin(.*)[\r\n]+
This part simply finds the first line. I only added a space here to exclude it from the result.
(:desc(.*)[\r\n]+)?
This is your original optional party, which should find the second line. Added space here as well.
(.*)[\r\n]+
This is the code party, but in your case, this was greedy, so it also found the optional party for :desc In order to change this, the negative lookahead excluded this part, and as you wanted to change this to be optional, this was changed to: ^(?!:desc)(.*)[\r\n]+
- The "^" also made sure it was a beginning of a new line.
:end(?:.*)[\r\n]+
No changes needed here.
Additional improvements
Not sure if needed or wanted, but in order to clean up the statement, I changed this a bit, and this one also captures the second text block.
~:begin ([^$]*)(?::desc([^$]*))?^(?!:desc)(?:([^$]*))?:end+~msU
This code is using "$" in order to check for the end of each line, so you don't have to check for newline characters anymore.
Upvotes: 1