Reputation: 1568
I'm writing a Python program for searching comments in c++ program using regex. I wrote the following code:
import re
regex = re.compile(r'(\/\/(.*?))\n|(\/\*(.|\n)*\*\/)')
comments = []
text = ""
while True:
try:
x= raw_input()
text = text + "\n"+ x
except EOFError:
break
z = regex.finditer(text)
for match in z:
print match.group(1)
this code should detect comment of type //I'm comment
and /*blah blah blah
blah blah*/
I'm getting following output:
// my program in C++
None
//use cout
Which is I'm not expecting. My thought is match.group(1) should capture the first parenthesis of (\/\*(.|\n)*\*\/)
, but it is not.
The c++ program I'm testing is:
// my program in C++
#include <iostream>
/** I love c++
This is awesome **/
using namespace std;
int main ()
{
cout << "Hello World"; //use cout
return 0;
}
Upvotes: 1
Views: 255
Reputation:
Adding another answer.
(Note - the problem you are having does not relate to the alternation order
of the comment sub-expressions.)
Yours is the simplified regex version to get C++ comments
and if you don't want the complete version, we can take a look at
why you're having a problem.
First of all your regex is almost correct. There is one problem
with the sub-expression for /* ... */
comments. The content must be made
non-greedy.
Other than that it works like it should.
But you should look at the capture groups a little closer.
In your code you only print group 1 on each match, which is the // ...
comment. You could either check for a match in group 1 and 3 or,
just print out group 0 (the entire match).
Additionally, you don't need the lazy quantifier ?
in group 2, and
the newline \n
below it should NOT be there.
And, consider making all the capture groups non-capturing (?: .. )
.
So, remove the ?
quantifier and \n
in the // ...
sub-expression.
And add the ?
quantifier in the /* ... */
sub-expression.
Here is your original regex Formatted - (using RegexFormat 5 with auto comments)
# raw regex: (//(.*?))\n|(/\*(.|\n)*\*/)
( # (1 start)
//
( .*? ) # (2)
) # (1 end)
\n
|
( # (3 start)
/\*
( . | \n )* # (4)
\*/
) # (3 end)
Here it is without the capture groups and the 2 minor quantifier changes.
# raw regex: //(?:.*)|/\*(?:.|\n)*?\*/
//
(?: .* )
|
/\*
(?: . | \n )*?
\*/
Output
** Grp 0 - ( pos 0 , len 21 )
// my program in C++
---------------------------
** Grp 0 - ( pos 43 , len 38 )
/** I love c++
This is awesome **/
---------------------------
** Grp 0 - ( pos 143 , len 10 )
//use cout
Upvotes: 0
Reputation: 56
use group(0) the content in 'txt' file is your example:
import re
regex = re.compile(r'(\/\/(.*?))\n|(\/\*(.|\n)*\*\/)')
comments = []
text = ""
for line in open('txt').readlines():
text = text + line
z = regex.finditer(text)
for match in z:
print match.group(0).replace("\n","")
I got output as:
// my program in C++
/** I love c++ This is awesome **/
//use cout
To help guys understand:
import re
regex = re.compile(r'((\/\/(.*?))\n|(\/\*(.|\n)*\*\/))')
comments = []
text = ""
for line in open('txt').readlines():
text = text + line
z = regex.finditer(text)
for match in z:
print match.group(1)
would output:
// my program in C++
/** I love c++
This is awesome **/
//use cout
Upvotes: 0
Reputation: 89557
You didn't use the good order to do that since an inline comment can be include inside a multiline comment. So you need to begin your pattern with the multiline comment. Example:
/\*[\s\S]*?\*/|//.*
Note that you can improve this pattern if you have long multiline comments (this syntax is an emulation of the atomic group feature that is not supported by the re module):
/\*(?:(?=([^*]+|\*(?!/))\1)*\*/|//.*
But note too that there are other traps like a string that contains /*...*/
or //.....
.
So if you want to avoid these cases, for example if you want to make a replacement, you need to capture before strings and to use a backreference in the replacement string, like this:
(pattern for strings)|/\*[\s\S]*?\*/|//.*
replacement: $1
Upvotes: 1
Reputation:
Unfortunately you have to parse quotes and non-comments at the same time because
partial comment syntax can be embedded within them.
Here is an old Perl regex that does this. Of interest on a match is Capture group 1
contains a comment. So do while loop using a global search. Check for group 1 matching.
# (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)
( # (1 start), Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
) # (1 end)
|
( # (2 start), Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)
Upvotes: 0