Best solution of matching C-style multiple line comments in flex?

Question

I have collect a lot of solutions of matching C-style multiple line comments in flex:

(1) forgot reference

"/*"                    { BEGIN COMMENT; }
"*/"           { BEGIN INITIAL; }
([^*]|
)+|.   { /* skip everything */ }
<>        {
                            fatal_error("unterminated comment!");
                            return 0;
                        }

(2) https://www.cs.virginia.edu/~cr4bd/flex-manual/How-can-I-match-C_002dstyle-comments_003f.html#How-can-I-match-C_002dstyle-comments_003f

{
"/*"              BEGIN(IN_COMMENT);
}
{
"*/"      BEGIN(INITIAL);
[^*
]+   // eat comment in chunks
"*"       // eat the lone star

        yylineno++;
}

(3) discards C comments in https://www.cs.virginia.edu/~cr4bd/flex-manual/Start-Conditions.html#Start-Conditions

%x comment
    %%
            int line_num = 1;

    "/*"         BEGIN(comment);

    [^*
]*        /* eat anything that's not a '*' */
    "*"+[^*/
]*   /* eat up '*'s not followed by '/'s */
    
             ++line_num;
    "*"+"/"        BEGIN(INITIAL);

(4) difficulty getting c-style comments in flex/lex

"/*"((("*"[^/])?)|[^*])*"*/"

(5) https://stackoverflow.com/a/13368522/4438921

"/*"((\*+[^/*])|([^*]))*\**"*/"

(6) This actually a regex string for matching C-style multiple line comment, I'm not sure whether it's possible to rewrite for flex: https://stackoverflow.com/a/36328890/4438921

String pat = "/\*[^*]*\*+(?:[^/*][^*]*\*+)*/";

Which one actually is the best ?

rici · Accepted Answer

None of the patterns given is actually correct for C or C++, because they don't take into consideration line splicing or trigraphs. (You might consider trigraphs unnecessary these days, and I wouldn't disagree, but even though they are now deprecated, you might still need to process legacy files which used them.)

(This might not be a consideration for a language which is neither C nor C++, but which has similar multiline comments. In that case, it's a toss-up between the monolithic regular expression and the start condition, but I would choose the start condition to avoid slow-down from very long comments.)

While you can write a monolithic regex which includes splices, you'll find it much easier to write (and read) if you use the start-condition based solution. Of the two extracted from the flex manual, I think (3) is slightly more performant, although in both cases my inclination would be to let flex do the line number counting instead of trying to so it explicitly. Even with %option yylineno matching the comment one line at a time is probably a good idea, since comments can be quite long and flex is optimised for tokens which do not exceed about 8k.

To handle line splices, you would modify it to:

%option yylineno
%x COMMENT
splice                  ($$[:blank:]]*\n)*
%%
[/]{splice}[*]          BEGIN(COMMENT);

{
  [^*\\n]+             /* eat anything that's not a '*' or line end */
  "*"+[^*/\\n]*        /* eat up '*'s not followed by '/'s or line end */
  [*]{splice}[/]        BEGIN(INITIAL);
  [*$$                 /* stray '*' or backslash */
  \n                    /* Reduce the amount of work needed for yylineno */
}

If you want to handle trigraphs, you'll need to expand the definition of splice and add some more rules to for ?.

A line splice is a backslash at the end of a line, indicating that the next line is a continuation. The backslash and the newline are removed from the input text, so that the last character of the continued line is followed immediately by the first character of the continuation line. Thus, the following is a valid comment:

/\
************** START HERE **************\
/

Gcc and clang (and quite possibly other compilers) allow the backslash character to be followed by whitespace, since otherwise the difference between a valid continuation and a stray backslash is not visible.

Continuation lines are handled before almost any other processing, so that they can be place inside string literals, comments, or any token. They're mostly used in #define preprocessor directives to comply with the requirement that a preprocessor directive is a single input line. But someone intent on obfuscating C code could use them more liberally. They can, for example, be used to extend C++-style single line comments over multiple physical lines:

// This is a comment...\
   which extends over...\
   three lines.

The only processing which happens before line continuations is trigraph processing. You can search for trigraphs on Wikipedia (or elsewhere); I'll limit myself to noting that the backslash is one of the characters which has a trigraph equivalent, ??/. Since trigraphs are processed before continuation lines, the first example of a spliced multiline comment could have been written:

/??/
************** START HERE **************\
/

Some compilers do not handle trigraphs by default; they may issue a warning if a trigraph is seen. If you want to try the above with gcc, for example, you'll need to either specify an ISO C standard (eg. -std=c11) or provide the -trigraphs command-line flag.

Best solution of matching C-style multiple line comments in flex?

Answers (2)

Related Questions