Reputation: 477
I'm trying to match a Python-style single- and multi-line strings. Here's what I've come up wtih so far:
public const string PythonString = @"(?<string>('''[^(''')]*''')|(""""""[^("""""")]*"""""")|("".*"")|('.*'))";
It fails when you have, for example, a single "
in a triple-"
matching string:
"""
msg = "Nothing in this file is used in w3af. This was a test that was truncated by my personal\
lack of interest in using encryption here, my lack of time and the main reason: I'm lazy ;)\
Also, pyrijndael was only used here, so I removed the dependency, which was a problem for debian."
raise Exception(msg)
"""
Here, the "
in the string forces the regex to stop the match after the first triple-"
, instead of matching the whole block.
How do I fix this?
Upvotes: 2
Views: 94
Reputation: 626871
It is a common misconception that placing a sequence of chars into a negated character class will result in matching a sequence of chars other than the specified sequence. In fact, [^(''')]*
= [^)(']*
.
You need to use lookaheads here together with negated character classes:
@"(?s)(?<string>('''[^']*(?:'(?!'')[^']*)*''')|(""""""[^""]*(?:""(?!"""")[^""]*)*"""""")|(""[^""\\]*(?:\\.[^""\\]*)*"")|('[^'\\]*(?:\\.[^'\\]*)*'))"
The [^']*(?:'(?!'')[^']*)*
matches
[^']*
- any 0+ chars other than '
(?:'(?!'')[^']*)*
- 0+ sequences of:
'(?!'')
- a '
not followed with two '
chars[^']*
- any 0+ chars other than '
.When matching single quote literals, you need to account for escaped chars, so you need [^'\\]*(?:\\.[^'\\]*)*
in between the quotes inside the pattern:
[^'\\]*
- any 0+ chars other than '
and \
(?:\\.[^'\\]*)*
- zero or more sequences of
\\.
- a \
followed with any char[^'\\]*
- any 0+ chars other than '
and \
Upvotes: 2