john1994
john1994

Reputation: 19

Regex to find multiline comments in Python that contain a certain word

How can I define a regex to find multiline comments in python that contain the word "xyz". Example for a string that should match:

"""
blah blah
blah
xyz
blah blah
"""

I tried this regex:

"""((.|\n)(?!"""))*?xyz(.|\n)*?"""

(grep -i -Pz '"""((.|\n)(?!"""))?xyz(.|\n)?"""')

but it was not good enough. for example, for this input

 """
    blah blah blah
    blah
"""

   # xyz
               
 def foo(self):
"""
blah
"""

it matched this string:

"""

   # xyz
               
 def foo(self):
"""

The expected behavior in this case it to not match anything since "xyz" is not inside a comment block.

I wanted it to only find "xyz" within opening quotes and closing quotes, but the string it matches is not inside a quotes block. It matches a string that starts with a quote, has "xyz" in it and ends with a quote, but the matched string is NOT inside a python comment block.

Any idea how to get the required behavior from this regex?

Upvotes: 1

Views: 193

Answers (1)

bobble bubble
bobble bubble

Reputation: 18490

The main challenge is keeping the """ ... """ balance of inside and outside a comment.
Here an idea with PCRE (e.g. PyPI regex with Python) or grep -Pz (like in your example).

(?ims)^"""(?:(?:[^"]|"(?!""))*?(xyz))?.*?^"""(?(1)|(*SKIP)(*F))

See this demo at regex101 (used with i ignorecase, m multiline and s dotall flags)

This works because the searchstring is matched optional to prevent backtracking into another match and loosing overall balance. The most simple pattern for keeping the balance would be """.*?""". But as soon as you want to match some substring inside, the regex engine will try to succeed.

To get around this, the searchstring can be matched optionally for keeping balance by preventing backtracking. Simplified example: """([^"]*?xyz)?.*?""" VS not wanted """([^"]*?xyz).*?""".

Now to still let the matches without searchstring fail, I used a conditional afterwards together with PCRE verbs (*SKIP)(*F). If the first group fails (no searchstring inside) the match just gets skipped.


For usage with grep here is a demo at tio.run, or alternatively: pcregrep -M '(?is)pattern'
As mentioned above in Python this pattern requires PyPI regex, see a Python demo at tio.run.

Upvotes: 1

Related Questions