baallezx
baallezx

Reputation: 481

Python regular expression for a comment in a long string

I am trying to work out a good regular expression for a python comment(s) that is located within a long string. So far I have

regex:

#(.?|\n)*

string:

'### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something()\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'

I feel like there is a much better way to get all of the individual comments from the string, but I am not an expert in regular expressions. Does anyone have a better solution?

Upvotes: 3

Views: 154

Answers (3)

Braj
Braj

Reputation: 46841

Get the comments from matched group at index 1.

(#+[^\\\n]*)

DEMO

Sample code:

import re
p = re.compile(ur'(#+[^\\\n]*)')
test_str = u"..."

re.findall(p, test_str)

Matches:

1.  ### this is a comment
2.  # this call outputs an xml stream of the current parameter dictionary.
3.  # wow another comment

Upvotes: 1

user2555451
user2555451

Reputation:

Regex will work fine if you do two things:

  1. Remove all string literals (since they can contain # characters).

  2. Capture everything that starts with a # character and proceeds to the end of the line.

Below is a demonstration:

>>> from re import findall, sub
>>> string = '### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something()\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'
>>> findall("#.*", sub('(?s)\'.*?\'|".*?"', '', string))
['### this is a comment', '# this call outputs an xml stream of the current parameter dictionary.', '# wow another comment']
>>>

re.sub removes anything of the form "..." or '...'. This saves you from having to worry about comments that are inside string literals.

(?s) sets the dot-all flag, which allows . to match newline characters.

Lastly, re.findall gets everything that starts with a # character and proceeds to the end of the line.


For a more complete test, place this sample code in a file named test.py:

# Comment 1  
for i in range(10): # Comment 2
    print('#foo')
    print("abc#bar")
    print("""
#hello
abcde#foo
""")  # Comment 3
    print('''#foo
    #foo''')  # Comment 4

The solution given above still works:

>>> from re import findall, sub
>>> string = open('test.py').read()
>>> findall("#.*", sub('(?s)\'.*?\'|".*?"', '', string))
['# Comment 1', '# Comment 2', '# Comment 3', '# Comment 4']
>>>

Upvotes: 1

alecxe
alecxe

Reputation: 473853

Since this is a python code in the string, I'd use tokenize module to parse it and extract comments:

import tokenize
import StringIO

text = '### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something():\n    # this call outputs an xml stream of the current parameter dictionary.\n    paramtertools.print_header(params)\n\nfor i in xrange(256):    # wow another comment\n    print i**2\n\n'

tokens = tokenize.generate_tokens(StringIO.StringIO(text).readline)
for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokens:
    if toktype == tokenize.COMMENT:
        print ttext

Prints:

### this is a comment
# this call outputs an xml stream of the current parameter dictionary.
# wow another comment

Note that the code in the string has a syntax error: missing : after the do_something() function definition.

Also, note that ast module would not help here, since it doesn't preserve comments.

Upvotes: 1

Related Questions