trollpidor
trollpidor

Reputation: 477

Regular expression does not work as intended

I'm trying to match a Python-style single- and multi-line strings. Here's what I've come up wtih so far:

public const string PythonString = @"(?<string>('''[^(''')]*''')|(""""""[^("""""")]*"""""")|("".*"")|('.*'))";

It fails when you have, for example, a single " in a triple-" matching string:

"""
msg = "Nothing in this file is used in w3af. This was a test that was truncated by my personal\
lack of interest in using encryption here, my lack of time and the main reason: I'm lazy ;)\
Also, pyrijndael was only used here, so I removed the dependency, which was a problem for debian."
raise Exception(msg)
"""

Here, the " in the string forces the regex to stop the match after the first triple-", instead of matching the whole block. How do I fix this?

Upvotes: 2

Views: 94

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626871

It is a common misconception that placing a sequence of chars into a negated character class will result in matching a sequence of chars other than the specified sequence. In fact, [^(''')]* = [^)(']*.

You need to use lookaheads here together with negated character classes:

@"(?s)(?<string>('''[^']*(?:'(?!'')[^']*)*''')|(""""""[^""]*(?:""(?!"""")[^""]*)*"""""")|(""[^""\\]*(?:\\.[^""\\]*)*"")|('[^'\\]*(?:\\.[^'\\]*)*'))"

The [^']*(?:'(?!'')[^']*)* matches

  • [^']* - any 0+ chars other than '
  • (?:'(?!'')[^']*)* - 0+ sequences of:
    • '(?!'') - a ' not followed with two ' chars
    • [^']* - any 0+ chars other than '.

When matching single quote literals, you need to account for escaped chars, so you need [^'\\]*(?:\\.[^'\\]*)* in between the quotes inside the pattern:

  • [^'\\]* - any 0+ chars other than ' and \
  • (?:\\.[^'\\]*)* - zero or more sequences of
    • \\. - a \ followed with any char
    • [^'\\]* - any 0+ chars other than ' and \

Upvotes: 2

Related Questions