Reputation: 5097
>>> re.match(r'"([^"]|(\\\"))*"', r'"I do not know what \"A\" is"').group(0)
'"I do not know what \\"'
>>> re.match(r'"((\\\")|[^"])*"', r'"I do not know what \"A\" is"').group(0)
'"I do not know what \\"A\\" is"'
These two regexes are intended to be looking for quoted strings, with escaped quote sequences. The difference, unless I am missing something, is the order of the disjunction in the parentheses.
Why do they not both accept the entire string?
Upvotes: 1
Views: 95
Reputation:
What you say is true, the order is different. And something else is different.
The first one "([^"]|(\\\"))*"
will match an escape, making it
match "asdf\"
sde" while the other one doesn't.
Also if you have to handle escape quote, you have to handle escapes as well. So, neither one is valid.
Here are two kind of standard ways to do this.
Both handle the escape.
You can extend this to single quotes as well.
Use the Dot-All modifier (?s)
if you want to span newlines.
Method 1. - alternation
"(?:\\.|[^"\\]+)*"
"
(?:
\\ . # Escape anything
| # or,
[^"\\]+ # Not escape not quote
)*
"
Method 2. - unrolled loop
"[^"\\]*(?:\\.[^"\\]*)*"
"
[^"\\]* # Optional not escape not quote
(?:
\\ . # Escape anything
[^"\\]* # Optional not escape not quote
)*
"
Both do the same. Method 2 is three to five time faster than Method 1.
Upvotes: 2
Reputation: 3752
The regex r'(A|B)'
will test first try to match A, and only if that fails will it try to match B (docs)
So the regex ([^"]|(\\\")
will first try to match a non-quote, and if that fails it will try to match an escaped quote mark.
So when the regex reaches \"A\"
the first part part matches the \
(it is not a quote. But then neither part matches "
, so the match ends there. The backslash is gobbled by the [^"]
so the second half of the expression is never used.
Turned around ((\\")|[^"]), when it reaches \"A\"
will first try to match the \"
(it works) then it will try to match A
(it matches [^"]
and so the match continues.
Upvotes: 0
Reputation: 9650
The order in alternation groups matters.
In the first regex the [^"]
alternative is tried first for every character. It matches every single character up to (and including) the first \
. On the next character ("
) this alternative ([^"]
) fails and the other one (\\\"
) tried. The latter also fails since "A
does not match \\\"
. This stop the quantifier *
from further matches.
In the second regex the \\\"
alternative (parenthesis are redundant) tried first for every character and fails so the second alternative ([^"]
) matches. But on at the first \
the first alternative matches so the lookup pointer moves past \"
to A
and lookup goes on.
As a general rule of thumb, place the most narrow expression in alternation first.
Upvotes: 1