Adrian Z.
Adrian Z.

Reputation: 934

Figuring out Flex (lexer) yy_push_state

What would the Regex equivalent be of the following Flex structure? I'm trying to recreate Rusts grammar for a project but right now I'm stuck on this piece? This is the grammar for an inner/outer documentation comment (Rust has six types of comments). It should match comments like /** */ and /*! */ but for example I don't understand why [^*] is needed on the first line and what the order of matching is in this case.

\/\*(\*|\!)[^*]       { yy_push_state(INITIAL); yy_push_state(doc_block); yymore(); }
<doc_block>\/\*       { yy_push_state(doc_block); yymore(); }
<doc_block>\*\/       {
    yy_pop_state();
    if (yy_top_state() == doc_block) {
        yymore();
    } else {
        return ((yytext[2] == '!') ? INNER_DOC_COMMENT : OUTER_DOC_COMMENT);
    }
}
<doc_block>(.|\n)     { yymore(); }

As far as I understand: line 1, matches the start /** or /*!; line 2, matches a block comment (for some reason?); line 3, matches the end */; line 11, matches any character or a newline (why?).

Two lines further it also matches for the normal block comment. Why is it also matching for it inside the doc comment?

\/\*                  { yy_push_state(blockcomment); }
<blockcomment>\/\*    { yy_push_state(blockcomment); }
<blockcomment>\*\/    { yy_pop_state(); }
<blockcomment>(.|\n)   { }

Upvotes: 0

Views: 1106

Answers (1)

rici
rici

Reputation: 241771

The flex state stack allows lexical analysis of strings which cannot be described by a regular expression, so there is no regular expression equivalent to that flex specification. For documentation of the state stack, including the syntax for writing state-contingent rules, see the flex manual.

Rust is infamously badly documented, and the comment syntax(es) fall into that category. The rust book mentions block comments in the syntax index but fails to document the precise syntax in the referenced comments section. I couldn't find any precise description of the syntax understood by rustdoc, either.

I've reverse-engineered the syntax from the flex excerpt you cite, but take it with a grain of salt; it may have only a passing resemblance to the actual syntax accepted by rustc and rustdoc:

  1. Rust block comments, unlike C or C++ block comments, can be nested. That makes them parenthetic syntaxes, which are not regular; they require a pushdown automaton to parse. So no regular expression can describe Rust block comments, and it is necessary to resort to a flex state stack to recognize them.

  2. Rust documentation block comments must start with a slash and precisely two stars (or a star and an exclamation point). A documentation box:

    /*************************************
     *        START OF SECTION           *
     *************************************\
    

    is not considered a documentation comment.

    (I suspect that not recognizing inner block comments starting `/!' was an oversight, but who knows.)

If the above is correct, it is possible to answer your questions:

  1. "I don't understand why [^*] is needed on the first line"

    This is to avoid matching box comments, as noted above.

  2. "what the order of matching is in this case."

    In all cases, flex selects the longest possible match at any point in the input, and if more than one rule matches the same longest string, it selects the first rule in the file. This is the so-called "maximal munch" rules. So given the two rules (which I wrote without the forest of leaning timber because I find it unreadable):

    "/*"[*!][^*]     {  DocComment(); }
    "/*"             {  BlockComment(); }
    

    the second rule will apply to the inputs /* Comment and /****, matching two characters, whereas the first rule will apply to /** Documentation comment, matching four characters. (It will also incorrectly apply to /**/, which IMHO should be analyzed as an empty block comment rather than the start of a documentation comment.)

  3. " line 11, matches any character or a newline (why?)"

    Yes, it does. If it didn't match any character, that character would not be matched by any rule, which would be incorrect.

  4. "Two lines further it also matches for the normal block comment. Why is it also matching for it inside the doc comment?"

    Because the match inside the doc comment only applies inside doc comments. A block comment not inside a doc comment also needs to be matched. However, it is certainly the case that some refactoring is possible here, which could simplify the lexical description.

Upvotes: 1

Related Questions