Reputation: 4373

Why is commenting out multiline comments in c++ inconsistent?

So we know that

// This doesn't affect anything

/*
This doesn't affect anything either
*/

/*
/* /* /*
This doesn't affect anything
*/
This does because comments aren't recursive

/* /*
This doesn't affect anything
*/ */
This throws an error because the second * / is unmatched since comments aren't recursive

I've heard that the reason they aren't recursive is because they would slow down the compiler, and I guess that makes sense. However nowadays when I'm parsing c++ code in a higher level language (say Python), I can simply use the regular expression

"\/[\/]+((?![\n])[\s\S])*\r*\n"

to match // single line comments, and use

"\/\*((?!\*\/)[\s\S])*\*\/"

to match /* multiline comments */, then loop through all single line comments, remove them, then loop through all multi-line comments and remove them. Or vice versa. But that's where I'm stuck. It seems that doing one or the other isn't sufficient, because:

// /*
An error is thrown because the /* is ignored
*/

/*
This doesn't affect things because of mysterious reasons
// */

and

/*
This throws an error because the second * / is unmatched
// */ */

What is the reason for this behavior? Is it also an artifact of the way the compilers parse things? To be clear I don't want to change the behavior of c++, I would just like to know the reasoning behind the second set of examples behaving they way they do.

Edit:

So yes, to be more explicit, my question is why the following three (seemingly reasonable) ways of explaining this behavior don't work:

Simply ignore all characters on a line after // regardless of whether they are /* or * /, even if you are in a multiline comment.
Allow a / * or */ followed by a // to still have effect.
Both of the above.

I understand why nested comments aren't allowed, because they would require a stack and arbitrarily high amounts of memory. But these three cases would not.

Edit again:

If anyone is interested, here is the following code to extract comments of a c/c++ file in python following the correct commenting rules discussed here:

import re
commentScanner = re.Scanner([
  (r"\/[\/]+((?![\n])[\s\S])*\r*(\n{1})?", lambda scanner, token: ("//", token)),
  (r"\/\*((?!\*\/)[\s\S])*\*\/", lambda scanner, token: ("/* ... */", token)),
  (r"[\s\S]", lambda scanner, token: None)
])
commentScanner.scan("fds a45fsa//kjl fds4325lkjfa/*jfds/\nk\lj\/*4532jlfds5342a  l/*a/*b/*c\n//fdsafa\n\r\n/*jfd//a*/fd// fs54fdsa3\r\r//\r/*\r\n2a\n\n\nois")

Upvotes: -1

Answers (3)

user557597

Reputation:

Yeah like everything inside a comment is just text, but when you remove the comment delimiter,
the exposed text becomes available to be parsed again.
So if part of that text had comment delimiter literals, they become parse-able as a new comment delimiter.

And its always a first come, first serve issue, i.e. left to right order.

It might be a little simplistic to think that parsing comments is simplistic.
The fact is that quotes must be parsed at the very same time (both single/double) and whatever is first encountered comments/quote, is served first.

Finally, everything inside a comment being skipped means that if you remove the outter
comment layer, everything remaining not a valid comment will be parsed as
part of the language. That means there is no certainty about any exposed comment format,
and the chances of getting a parse error is great, if not inevitable.

I believe also that C++ has a line continuation form for // style comments as well.
For example:

// single line continuation\
continuation               \  
end here 
code

So the formula to parse C++ comments with regular expressions is that you have
to parse (match) every single character in the file.
If you just go straight for the comments it will inject the match into
the wrong place.

A good regex to parse comments is below. I originally got this off a Perl group
and slightly modified it for single line comment and continuation.
With it you could remove comments or just find comments.

Raw regex:

   # (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)


   (                                # (1 start), Comments 
        /\*                              # Start /* .. */ comment
        [^*]* \*+
        (?: [^/*] [^*]* \*+ )*
        /                                # End /* .. */ comment
     |  
        //                               # Start // comment
        (?: [^\\] | \\ \n? )*?           # Possible line-continuation
        \n                               # End // comment
   )                                # (1 end)
|  
   (                                # (2 start), Non - comments 
        "
        (?: \\ [\S\s] | [^"\\] )*        # Double quoted text
        "
     |  '
        (?: \\ [\S\s] | [^'\\] )*        # Single quoted text
        ' 
     |  [\S\s]                           # Any other char
        [^/"'\\]*                        # Chars which doesn't start a comment, string, escape,
                                         # or line continuation (escape + newline)
   )                                # (2 end)

Enhanced (preserve formatting), mostly used to delete comments.
Use multi-line mode:

   # ((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|(?:\r?\n|[\S\s])[^/"'\\\s]*)

   (                                # (1 start), Comments 
        (?:
             (?: ^ [ \t]* )?                  # <- To preserve formatting
             (?:
                  /\*                              # Start /* .. */ comment
                  [^*]* \*+
                  (?: [^/*] [^*]* \*+ )*
                  /                                # End /* .. */ comment
                  (?:                              # <- To preserve formatting 
                       [ \t]* \r? \n                                      
                       (?=
                            [ \t]*                  
                            (?: \r? \n | /\* | // )
                       )
                  )?
               |  
                  //                               # Start // comment
                  (?:                              # Possible line-continuation
                       [^\\] 
                    |  \\ 
                       (?: \r? \n )?
                  )*?
                  (?:                              # End // comment
                       \r? \n                               
                       (?=                              # <- To preserve formatting
                            [ \t]*                          
                            (?: \r? \n | /\* | // )
                       )
                    |  (?= \r? \n )
                  )
             )
        )+                               # Grab multiple comment blocks if need be
   )                                # (1 end)

|                                 ## OR

   (                                # (2 start), Non - comments 
        "
        (?: \\ [\S\s] | [^"\\] )*        # Double quoted text
        "
     |  '
        (?: \\ [\S\s] | [^'\\] )*        # Single quoted text
        ' 
     |  (?: \r? \n | [\S\s] )            # Linebreak or Any other char
        [^/"'\\\s]*                      # Chars which doesn't start a comment, string, escape,
                                         # or line continuation (escape + newline)
   )                                # (2 end)

Upvotes: 1

Yakk - Adam Nevraumont

Reputation: 275385

When a comment starts, everything until the end of the comment is treated as a comment.

So zero // one */ two all by itself could have zero // one */ be the end of a /* */ comment from a previous line, two two outside the comment, or it could be a new single-line comment that starts // one */ two, with zero out side the comment.

As a theory why this was done, // is not a valid C token or token sequence. So there are no programs with // outside of a comment or string in C.

However, // within a comment would be legal. So a header file containing:

/* this is a C style comment
// with some cool
// slashes */

would break if we made // comment out the trailing */. Within a /* */ comment, // is ignored. Compatibility with C was not to be broken for no good reason.

And within a //, everything gets ignored until the end of the line. No sneaky /* or allowed.

The parsing rule is really easy -- start a comment, slurp and discard until you see the end token (either a newline, or a */ depending), then continue parsing.

As C++ is not designed to be parsed by regular expressions, your difficulty parsing it with regular expressions was either not considered, or not considered important.

Upvotes: 0

Brian Bi

Reputation: 119144

It's not inconsistent. The existing behaviour is both easy to specify and easy to implement, and your compiler is implementing it correctly. See [lex.comment] in the standard.

The characters /* start a comment, which terminates with the characters */. These comments do not nest. The characters // start a comment, which terminates with the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear between it and the new-line that terminates the comment; no diagnostic is required. [ Note: The comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment. — end note ]

As you can see, // can be used to comment out both /* and */. It's just that comments don't nest, so if the // is already inside a /*, then the // has no effect at all.

Upvotes: 6

Why is commenting out multiline comments in c++ inconsistent?

Answers (3)

Related Questions