shoosh
shoosh

Reputation: 78914

How to strip C++ style single line comments (`// ...`)

For a small DSL I'm writing I'm looking for a regex to match a comment string at the end of the like the // syntax of C++. The simple case:

someVariable = 12345; // assignment

Is trivial to match but the problem starts when I have a string in the same line:

someFunctionCall("Hello // world"); // call with a string

The // in the string should not match as a comment


EDIT - The thing that compiles the DSL is not mine. It's a black box as far as I'm which I don't want to change and it doesn't support comments. I just want to add a thin wrapper to make it support comments.

Upvotes: 1

Views: 2265

Answers (2)

Bart Kiers
Bart Kiers

Reputation: 170148

shoosh wrote:

EDIT - The thing that compiles the DSL is not mine. It's a black box as far as I'm which I don't want to change and it doesn't support comments. I just want to add a thin wrapper to make it support comments.

In that case, create a very simple lexer that matches one of three tokens:

  1. // ... comments
  2. string literals: " ... "
  3. or, if none of the above matches, match any single character

Now, while you iterate ov er these 3 different type of tokens, simply print tokens (2) and (3) to the stdout (or to a file) to get the uncommented version of your source file.

A demo with GNU Flex:

example input file, in.txt:

someVariable = 12345; // assignment
// only a comment
someFunctionCall("Hello // world"); // call with a string
someOtherFunctionCall("Hello // \" world"); // call with a string and 
                                            // an escaped quote

The lexer grammar file, demo.l:

%%
"//"[^\r\n]*             { /* skip comments */ }
"\""([^"]|[\\].)*"\""    {printf("%s", yytext);}
.                        {printf("%s", yytext);}
%%
int main(int argc, char **argv)
{
    while(yylex() != 0);
    return 0;
}

And to run the demo, do:

flex demo.l 
cc lex.yy.c -lfl
./a.out < in.txt

which will print the following to the console:

someVariable = 12345; 

someFunctionCall("Hello // world"); 
someOtherFunctionCall("Hello // \" world"); 

EDIT

I'm not really familiar with C/C++, and just saw @sehe's recommendation of using a pre-processor. That looks to be a far better option than creating your own (small) lexer. But I think I'll leave this answer since it shows how to handle this kind of stuff if no pre-processor is available (for whatever reason: perhaps cpp doesn't recognise certain parts of the DSL?).

Upvotes: 2

sehe
sehe

Reputation: 392931

EDIT

Since you are effectively preprocessing a sourcefile, why don't you use an existing preprocessor? If the language is sufficiently similar to C/C++ (especially regarding quoting and string literals), you will be able to just use cpp -P:

 echo 'int main() { char* sz="Hello//world"; /*profit*/ } // comment' | cpp -P

Output: int main() { char* sz="Hello//world"; }


Other ideas:

Use a proper lexer/parser instead

Have a look at

  • CoCo/R (available for Java, C++, C#, etc.)
  • ANTLR (idem)
  • Boost Spirit (with Spirit Lex to make it even easier to strip the comments)

All suites come with sample grammars that parse C, C++ or a subset thereof

Upvotes: 2

Related Questions