Reputation: 2230
I'm working on a routine to strip block or line comments from some C# code. I have looked at the other examples on the site, but haven't found the exact answer that I'm looking for.
I can match block comments (/* comment */) in their entirety using this regular expression with RegexOptions.Singleline:
(/\*[\w\W]*\*/)
And I can match line comments (// comment) in their entirety using this regular expression with RegexOptions.Multiline:
(//((?!\*/).)*)(?!\*/)[^\r\n]
Note: I'm using [^\r\n]
instead of $
because $
is including \r
in the match, too.
However, this doesn't quite work the way I want it to.
Here is my test code that I'm matching against:
// remove whole line comments
bool broken = false; // remove partial line comments
if (broken == true)
{
return "BROKEN";
}
/* remove block comments
else
{
return "FIXED";
} // do not remove nested comments */ bool working = !broken;
return "NO COMMENT";
The block expression matches
/* remove block comments
else
{
return "FIXED";
} // do not remove nested comments */
which is fine and good, but the line expression matches
// remove whole line comments
// remove partial line comments
and
// do not remove nested comments
Also, if I do not have the */ positive lookahead in the line expression twice, it matches
// do not remove nested comments *
which I really don't want.
What I want is an expression that will match characters, starting with //
, to the end of line, but does not contain */
between the //
and end of line.
Also, just to satisfy my curiosity, can anyone explain why I need the lookahead twice? (//((?!\*/).)*)[^\r\n]
and (//(.)*)(?!\*/)[^\r\n]
will both include the *, but (//((?!\*/).)*)(?!\*/)[^\r\n]
and (//((?!\*/).)*(?!\*/))[^\r\n]
won't.
Upvotes: 49
Views: 52110
Reputation: 4481
Also see my project for C# code minification: CSharp-Minifier
Aside of removing of comments, spaces and and line breaks from code, at present time it's able to compress local variable names and do another minifications.
Upvotes: 0
Reputation: 1423
for block Comments (/* ... */) you can use this exp:
/\*([^\*/])*\*/
it will work with multiline comments also.
Upvotes: 0
Reputation: 2325
I found this one at http://gskinner.com/RegExr/ (named ".Net Comments aspx")
(//[\t|\s|\w|\d|\.]*[\r\n|\n])|([\s|\t]*/\*[\t|\s|\w|\W|\d|\.|\r|\n]*\*/)|(\<[!%][ \r\n\t]*(--([^\-]|[\r\n]|-[^\-])*--[ \r\n\t%]*)\>)
When I test it it seems to remove all // comments and /* comments */ as it should, leaving those inside quotes behind.
Haven't tested it a lot, but seems to work pretty well (even though its a horrific monstrous line of regex).
Upvotes: 1
Reputation: 33908
You could tokenize the code with an expression like:
@(?:"[^"]*")+|"(?:[^"\n\\]+|\\.)*"|'(?:[^'\n\\]+|\\.)*'|//.*|/\*(?s:.*?)\*/
It would also match some invalid escapes/structures (eg. 'foo'
), but will probably match all valid tokens of interest (unless I forgot something), thus working well for valid code.
Using it in a replace and capturing the parts you want to keep will give you the desired result. I.e:
static string StripComments(string code)
{
var re = @"(@(?:""[^""]*"")+|""(?:[^""\n\\]+|\\.)*""|'(?:[^'\n\\]+|\\.)*')|//.*|/\*(?s:.*?)\*/";
return Regex.Replace(code, re, "$1");
}
using System;
using System.Text.RegularExpressions;
namespace Regex01
{
class Program
{
static string StripComments(string code)
{
var re = @"(@(?:""[^""]*"")+|""(?:[^""\n\\]+|\\.)*""|'(?:[^'\n\\]+|\\.)*')|//.*|/\*(?s:.*?)\*/";
return Regex.Replace(code, re, "$1");
}
static void Main(string[] args)
{
var input = "hello /* world */ oh \" '\\\" // ha/*i*/\" and // bai";
Console.WriteLine(input);
var noComments = StripComments(input);
Console.WriteLine(noComments);
}
}
}
Output:
hello /* world */ oh " '\" // ha/*i*/" and // bai
hello oh " '\" // ha/*i*/" and
Upvotes: 9
Reputation: 66573
Both of your regular expressions (for block and line comments) have bugs. If you want I can describe the bugs, but I felt it’s perhaps more productive if I write new ones, especially because I’m intending to write a single one that matches both.
The thing is, every time you have /*
and //
and literal strings “interfering” with each other, it is always the one that starts first that takes precedence. That’s very convenient because that’s exactly how regular expressions work: find the first match first.
So let’s define a regular expression that matches each of those four tokens:
var blockComments = @"/\*(.*?)\*/";
var lineComments = @"//(.*?)\r?\n";
var strings = @"""((\\[^\n]|[^""\n])*)""";
var verbatimStrings = @"@(""[^""]*"")+";
To answer the question in the title (strip comments), we need to:
Regex.Replace
can do this easily using a MatchEvaluator function:
string noComments = Regex.Replace(input,
blockComments + "|" + lineComments + "|" + strings + "|" + verbatimStrings,
me => {
if (me.Value.StartsWith("/*") || me.Value.StartsWith("//"))
return me.Value.StartsWith("//") ? Environment.NewLine : "";
// Keep the literal strings
return me.Value;
},
RegexOptions.Singleline);
I ran this code on all the examples that Holystream provided and various other cases that I could think of, and it works like a charm. If you can provide an example where it fails, I am happy to adjust the code for you.
Upvotes: 102
Reputation: 972
Before you implement this, you will need to create test cases for it first
There are probably more cases out there.
Once you have all of them, then you can create a parsing rule for each of them, or group some of them.
Solving this with regular expression alone probably will be very hard and error-prone, hard to test, and hard to maintain by you and other programmers.
Upvotes: 8