Reputation: 837
Im trying to find c style comments in a c file but im having trouble if there happens to be // inside of quotations. This is the file:
/*My function
is great.*/
int j = 0//hello world
void foo(){
//tricky example
cout << "This // is // not a comment\n";
}
it will match with that cout. This is what i have so far (i can match the /**/ comments already)
fp = open(s)
p = re.compile(r'//(.+)')
txt = p.findall(fp.read())
print (txt)
Upvotes: 2
Views: 2575
Reputation: 89557
The first step is to identify cases where //
or /*
must not be interpreted as the begining of a comment substring. For example when they are inside a string (between quotes). To avoid content between quotes (or other things), the trick is to put them in a capture group and to insert a backreference in the replacement pattern:
pattern:
(
"(?:[^"\\]|\\[\s\S])*"
|
'(?:[^'\\]|\\[\s\S])*'
)
|
//.*
|
/\*(?:[^*]|\*(?!/))*\*/
replacement:
\1
Since quoted parts are searching first, each time you find //
or /*...*/
, you can be sure that your are not inside a string.
Note that the pattern is voluntary inefficient (due to (A|B)*
subpatterns) to make it easier to understand. To make it more efficient you can rewrite it like this:
("(?=((?:[^"\\]+|\\[\s\S])*))\2"|'(?=((?:[^'\\]+|\\[\s\S])*))\3')|//.*|/\*(?=((?:[^*]+|\*(?!/))*))\4\*/
(?=(something+))\1
is only a way to emulate an atomic group (?>something+)
So, If you only want to find comments (and not to remove them), the most handy is to put the comments part of the pattern in capture group and to test if it isn't empty. The following pattern has been udapted (after Jonathan Leffler comment) to handle the trigraph ??/
that is interpreted as a backslash character by the preprocessor (I assume that the code isn't written for the -trigraphs
option) and to handle the backslash followed by a newline character that allows to format a single line on several lines:
fp = open(s)
p = re.compile(r'''(?x)
(?=["'/]) # trick to make it faster, a kind of anchor
(?:
"(?=((?:[^"\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\1" # double quotes string
|
'(?=((?:[^'\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\2' # single quotes string
|
(
/(?:(?:\?\?/|\\)\n)*/(?:.*(?:\?\?|\\)/\n)*.* # single line comment
|
/(?:(?:\?\?/|\\)\n)*\* # multiline comment
(?=((?:[^*]+|\*+(?!(?:(?:\?\?/|\\)\n)*/))*))\4
\*(?:(?:\?\?/|\\)\n)*/
)
)
''')
for m in p.findall(fp.read()):
if (m[2]):
print m[2]
These changes would not affect the pattern efficiency since the main work for the regex engine is to find positions that begin with a quote or a slash. This task is simplify by the presence of a lookahead at the begining of the pattern (?=["'/])
that allows internals optimizations to quickly find the first character.
An other optimization is the use of emulated atomic groups, that reduces the backtracking to the minimum and allows to use greedy quantifiers inside repeated groups.
NB: a chance there is no heredoc syntax in C!
Upvotes: 7
Reputation: 241701
Python's re.findall
method basically works the same way as most lexers do: it successively returns the longest match starting where the previous match finished. All that is required is to produce a disjunction of all the lexical patterns:
(<pattern 1>)|(<pattern 2>)|...|(<pattern n>)
Unlike most lexers, it doesn't require the matches to be contiguous, but that's not a significant difference since you can always just add (.)
as the last pattern, in order to match all otherwise unmatched characters individually.
An important feature of re.findall
is that if the regex has any groups, then only the groups will be returned. Consequently, you can exclude alternatives by simply leaving out the parentheses, or changing them to non-capturing parentheses:
(<pattern 1>)|(?:<unimportant pattern 2>)|(<pattern 3)
With that in mind, let's take a look at how to tokenize C just enough to recognize comments. We need to deal with:
// Comment
/* Comment */
"Might include escapes like \n"
'\t'
With that in mind, let's create regexen for each of the above.
//[^\n]*
/*[^*]*[*]+(?:[^/*][^*]*[*]+)*/
Note that it uses (?:...)
to avoid capturing the repeated group."(?:[^"\\]|\\.*)"
'(?:[^'\\]|\\.)*'
Finally, the goal was to find the text of C-style comments. So we just need to avoid captures in any of the other groups. Hence:
p = re.compile('|'.join((r"(//[^\n])*"
,r"/*[^*]*[*]+(?:[^/*][^*]*[*]+)*/"
,'"'+r"""(?:[^"\\]|\\.)*"""+'"'
,r"'(?:[^'\\]|\\.)*'")))
return [c[2:] for c in p.findall(text) if c]
Above, I left out some obscure cases which are unlikely to arise:
In an #include <...>
directive, the <...>
is essentially a string. In theory, it could contain quotes or sequences which look like comments, but in practice you will never see:
#include </*This looks like a comment but it is a filename*/>
A line which ends with \ is continued on the next line; the \ and following newline character are simply removed from the input. This happens before any lexical scanning is performed, so the following is a perfectly legal comment (actually two comments):
/\
**************** Surprise! **************\
//////////////////////////////////////////
To make the above worse, the trigraph ??/
is the same as a \, and that replacement happens before the continuation handling.
/************************************//??/
**************** Surprise! ************??/
//////////////////////////////////////////
Outside of obfuscation contests, no-one actually uses trigraphs. But they're still in the standard. The easiest way to deal with both of these issues would be to prescan the string:
return [c[2:]
for c in p.findall(text.replace('//?','\\').replace('\\\n',''))
if c]
The only way to deal with the #include <...>
issue, if you really cared about it, would be to add one more pattern, something like #define\s*<[^>\n]*>
.
Upvotes: 2