Reputation: 481
I am trying to work out a good regular expression for a python comment(s) that is located within a long string. So far I have
regex:
#(.?|\n)*
string:
'### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something()\n # this call outputs an xml stream of the current parameter dictionary.\n paramtertools.print_header(params)\n\nfor i in xrange(256): # wow another comment\n print i**2\n\n'
I feel like there is a much better way to get all of the individual comments from the string, but I am not an expert in regular expressions. Does anyone have a better solution?
Upvotes: 3
Views: 154
Reputation: 46841
Get the comments from matched group at index 1.
(#+[^\\\n]*)
Sample code:
import re
p = re.compile(ur'(#+[^\\\n]*)')
test_str = u"..."
re.findall(p, test_str)
Matches:
1. ### this is a comment
2. # this call outputs an xml stream of the current parameter dictionary.
3. # wow another comment
Upvotes: 1
Reputation:
Regex will work fine if you do two things:
Remove all string literals (since they can contain #
characters).
Capture everything that starts with a #
character and proceeds to the end of the line.
Below is a demonstration:
>>> from re import findall, sub
>>> string = '### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something()\n # this call outputs an xml stream of the current parameter dictionary.\n paramtertools.print_header(params)\n\nfor i in xrange(256): # wow another comment\n print i**2\n\n'
>>> findall("#.*", sub('(?s)\'.*?\'|".*?"', '', string))
['### this is a comment', '# this call outputs an xml stream of the current parameter dictionary.', '# wow another comment']
>>>
re.sub
removes anything of the form "..."
or '...'
. This saves you from having to worry about comments that are inside string literals.
(?s)
sets the dot-all flag, which allows .
to match newline characters.
Lastly, re.findall
gets everything that starts with a #
character and proceeds to the end of the line.
For a more complete test, place this sample code in a file named test.py
:
# Comment 1
for i in range(10): # Comment 2
print('#foo')
print("abc#bar")
print("""
#hello
abcde#foo
""") # Comment 3
print('''#foo
#foo''') # Comment 4
The solution given above still works:
>>> from re import findall, sub
>>> string = open('test.py').read()
>>> findall("#.*", sub('(?s)\'.*?\'|".*?"', '', string))
['# Comment 1', '# Comment 2', '# Comment 3', '# Comment 4']
>>>
Upvotes: 1
Reputation: 473853
Since this is a python code in the string, I'd use tokenize
module to parse it and extract comments:
import tokenize
import StringIO
text = '### this is a comment\na = \'a string\'.toupper()\nprint a\n\na_var_name = " ${an.injection} "\nanother_var = " ${bn.injection} "\ndtabse_conn = " ${cn.injection} "\n\ndef do_something():\n # this call outputs an xml stream of the current parameter dictionary.\n paramtertools.print_header(params)\n\nfor i in xrange(256): # wow another comment\n print i**2\n\n'
tokens = tokenize.generate_tokens(StringIO.StringIO(text).readline)
for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokens:
if toktype == tokenize.COMMENT:
print ttext
Prints:
### this is a comment
# this call outputs an xml stream of the current parameter dictionary.
# wow another comment
Note that the code in the string has a syntax error: missing :
after the do_something()
function definition.
Also, note that ast
module would not help here, since it doesn't preserve comments.
Upvotes: 1