Reputation: 17375
Given the following simple regular expression which goal is to capture the text between quotes characters:
regexp = '"?(.+)"?'
When the input is something like:
"text"
The capturing group(1) has the following:
text"
I expected the group(1) to have text
only (without the quotes). Could somebody explain what's going on and why the regular expression is capturing the "
symbol even when it's outside the capturing group #1. Another strange behavior that I don't understand is why the second quote character is captured but not the first one given that both of them are optional. Finally I fixed it by using the following regex, but I would like to understand what I'm doing wrong:
regexp = '"?([^"]+)"?'
Upvotes: 5
Views: 437
Reputation: 627536
Solution
regexp = '^"?(.*?)"?$'
Or, if the regex engine allows lookarounds
regexp = '(?<=^"?).*?(?="?$)'
Details
^
- start of string"?
- an optional "
char(.*?)
- Group 1: any zero or more chars other than line break chars as few as possible"?
- an optional "
char$
- end of string.
Explanationwhy the regular expression is capturing the " symbol even when it's outside the capturing group #1
The "?(.+)"?
pattern contains a greedy dot matching subpattern. A .
can match a "
, too. The "?
is an optional subpattern. It means that if the previous subpattern is greedy (and .+
is a greedy subpattern) and can match the subsequent subpattern (and .
can match a "
), the .+
will take over that optional value.
The negated character class is a correct way to match any characters but a certain one/range(s) of characters. [^"]
will never match a "
, so the last "
will never get matched with this pattern.
why the second quote character is captured but not the first one given that both of them are optional
The first "?
comes before the greedy dot matching pattern. The engine sees the "
(if it is in the string) and matches the quote with the first "?
.
Upvotes: 2
Reputation: 17781
Quantifiers in regular expressions are greedy: they try to match as much text as possible. Because your last "
is optional (you wrote "?
in your regular expression), the .+
will match it.
Using [^"]
is one acceptable solution. The drawback is that your string cannot contain "
characters (which may or may not be desirable, depending on the case).
Another is to make "
required:
regexp = '"(.+)"'
Another one is to make the +
non-greedy, by using +?
. However you also need to add anchors ^
and $
(or similar, depending on the context), otherwise it'll match only the first character (t
in the case of "test"
):
regexp = '^"?(.+?)"?$'
This regular expression allows "
characters to be in the middle of the string, so that "t"e"s"t"
will result in t"e"s"t
being captured by the group.
Upvotes: 2
Reputation: 6278
.+
matches any character as long as it can (including the "
). And when it reaches end of the input the "?
is matching as it means the "
is optional.
You should use "non greedy":
regex
"(.+?)"
Upvotes: 0
Reputation: 384
The regexp is greedy by default, it will try to match as much as possible as soon as possible.
Since your capturing group contains .+
, this will match the ending parenthesis before the "?
. Then, when exiting the group, it is at the end of your line, which is matched by the optional "
.
Upvotes: 0
Reputation: 4531
.+ is greedy. It'll collect everything including the ". Your final "? doesn't require that a quote be present, hence .+ includes the quote.
The first quote isn't captured because it's matched by the "?
Upvotes: 0