Reputation: 4022
I have a regex that seemingly is straightforward but does not act as required. The input to be parsed is described as follows (nb: {} are not part of the regex, only what's inside):
The whole construct SPACE-SPACEf1 is optional
Just to explain what is captured into f1: For the first char, I'm using the set of chars represented by [A-Za-z]. Followed by \w or space 0 or more times. This is captured into f1.
(?:\s*-\s*(?P<f1>[A-Za-z][\w\s]*))?$
I expect the following sequences to match and capture a value into f1:
I expect the following to not match:
I expect the following match but not capture a value into f1 (I would prefer it to not match though):
These are the actual results:
" - Bruce Dickinson" (f1=Bruce Dickinson) Tick; this works
"Bruce" (f1= not captured, but expression is a match. This is wrong, because Bruce doesn't match the optional part, and $ comes next which doesn't match Bruce)
" - Bruce!" (f1= not cpatured, but expression is a match; this is wrong, because of the !, which means that match does not appear at the end of line.
I expect that:
(?:\s*-\s*(?P<f1>[A-Za-z][\w\s]*))?
would consume { - Bruce}, which should leave !, which should fail because of the next regex token being $; however, the computer says no, so I'm wrong but I don't know why :(
If I can get this to work, I can get the rest of my expression to work the way I want it to. I need somebody else to jolt me into thinking about this differently. I've spent 2 days on this with no positive output, so very frustrating.
PS: I am using regex101.com to test regexes. The regexes will be used as part of a Rust application whose regex engine is based on google's RE2.
Eventually, I need to be able to recognise a sequence of names delimited by &, and the whole expression is optional by the use of ? and must appear at the end of line $.
So
and
But 1 step at a time!
Upvotes: 1
Views: 1261
Reputation: 627219
The point here is that you cannot match and not match something at the same time. If you make the whole pattern optional, and the end of string obligatory, even if there is nothing of interest the end of string will be matched - always.
The way out is to think of a subpattern you are interested in. You are interested in the names, so, make the first letter obligatory. The hyphen seems to be obligatory in all test cases you supplied, too. Everything else can be optional:
\s*-\s*(?P<f1>([^\W\d_])\w*(?:\s+\w+)*)(?:\s*&\s*(?P<f2>([^\W\d_])\w*(?:\s+\w+)*))*$
See the regex demo (the \s
is replaced with \h
and \n
added to the negated character classes just for demo purposes as it is a multiline demo).
Note that I replaced [a-zA-Z]
with [^\W\d_]
to make the pattern more flexible ([^\W\d_]
just matches any letter).
Upvotes: 4