How can Verilog's variable_lvalue be written for Bison?

Question

I'm working on a Verilog parser, using Bison to make the parser from the language's formal rules.

This formal syntax specification in BNF comes from the IEEE Standard 1364-2001 “IEEE Standard Verilog Hardware Description Language Reference Manual (LRM).

variable_lvalue ::=
    hierarchical_variable_identifier
  | hierarchical_variable_identifier [ expression ] { [ expression ] }
  | hierarchical_variable_identifier [ expression ] { [ expression ] } [ range_expression ]
  | hierarchical_variable_identifier [ range_expression ]
  | variable_concatenation

Leaving out hierarchical identifiers, breaking the reduction of a single expression into a range_expression into its own rule for clarity and using left recursion (range_expression might or might not be at the end of the list) this is what I have.

nonempty_expression_in_brackets_list
    : OPENBRACKETS range_expression CLOSEBRACKETS { }
    | OPENBRACKETS expression CLOSEBRACKETS { }
    | OPENBRACKETS expression CLOSEBRACKETS nonempty_expression_in_brackets_list { }
    ;

range_expression is defined like this.

range_expression ::= 
    expression
  | msb_constant_expression: lsb_constant_expression
  | base_expression +: width_constant_expression
  | base_expression -: width_constant_expression

Leaving out the rule that reduces a single expression to a range_expression, this is what I have.

range_expression
  : constant_expression COLON constant_expression { }
  | expression PLUS COLON constant_expression     { }
  | expression MINUS COLON constant_expression    { }
  ;

Even in the simplest case where only one expression is used there is a problem with functions. Specifically, the parser can't decide whether to reduce an identifier after a parenthesis to a constant_function_argument or to a function_argument (I expected either constant_function_call or function_call to be in a rule, but not both). For example something like this in the parsed text would be problematic.

IDENTIFIER [ IDENTIFIER ( IDENTIFIER ) ]

Is the third identifier reduced to a function_argument, expecting a later reduction to function_call and finally expression or is it reduced to a constant_function_argument, expecting a later reduction to constant_function_call and finally constant_expression?

I think both would be correct at this point, and we can't know for sure until we see CLOSEBRACKETS or COLON or PLUS COLON or MINUS COLON.

Should I add the bison's glr option which clones the parser and makes it vanish when it comes to conflicts? Is there a different way to write the grammar?

rici · Accepted Answer

I don't really know much about Verilog and there isn't even a documentation link in the OP; moreover, most of the referenced grammar productions are not in the OP either. So the best I can do is provide some general guidance, combined with a few snippets of knowledge I have about Verilog -- which is, as I understand, non-trivial to parse correctly.

As indicated, there are (at least) two issues in parsing variable_lvalue, although in the end they are surprisingly similar:

The Verilog grammar appears to require distinguishing syntactically between constant and non-constant expressions; and
There are (at least) three different meanings for [...] in a hierarchical_variable_name, and they have slightly different but overlapping syntaxes. (Here I'm being guided by a different SO question; this one only shows two of the three syntaxes.)

Constant and non-constant expressions

Technically, the grammar provides for constant_expressions and unqualified expressions, since each expression context will be produced either from expression or constant_expression (or some synonym for one of those two non-terminals). However, in a bottom-up parser, it would be easier if the two expression grammars were mutually exclusive. Ignoring operator precedence, the basic scheme would be:

expression             : constant_expression
                       | non_constant_expression;

non_constant_expression: non_constant_primary
                       | unary_operator non_constant_primary
                       | non_constant_expression binary_operator expression
                       | constant_expression binary_operator non_constant_expression

constant_expression    : constant_primary
                       | unary_operator constant_primary
                       | constant_expression binary_operator constant_expression

This will work fine, except for the problem with distinguishing between constant_primary and non_constant_primary. Clearly, literals (numbers and strings) are constant_primary, and (as it happens) hierarchical names with at least two components are non_constant_primary. But simple identifiers might be either.

For those cases (or languages) where identifiers must be declared before use, it would be possible to distinguish between constant and non-constant identifiers by looking the identifiers up in the symbol table.

For this to work, the symbol table needs to be shared with the lexer, and a certain amount of logic about name lookup needs to be available to the lexer, especially if the language allows scoped names. Many C parsers do exactly this, because there are expressions which cannot be parsed without knowing whether an identifier is a type or a variable. ((something)+1/2, for example: that is 0.5 if something is double, and 0.0 if something is a variable whose value is 0.0.)

But this strategy breaks down with constant functions, because constant functions do not need to be declared before use. (Mutually recursive constant functions are allowed.) Nonetheless, because constant function calls are evaluated in a different phase from non-constant functions calls, it is important to decide for each call site whether it is a constant call or not.

In addition, sharing the symbol table between the parser and the lexer violates separation of concerns. And it is not as easy as it sounds: the lexer cannot know whether the an instance of an identifier is a definitio or a direct use. It's non-trivial (for the lexer) to distinguish between the use of an identifier token as an unqualified name or as part of a hierarchical name. (I don't know whether hierarchical subcomponent names can be the same as parameter names, but it seems reasonable that they could be; certainly, in most languages a structure member name can be spelled identically to a global constant without creating any ambiguity.)

Added to the clear need to do semantic analysis to decide whether a given function identifier is eligible to be a constant_function_identifier or not, it seems to me that the only plausible strategy is to generate an AST, parsing expressions without regard to whether they are supposed to be constant or not, and then to perform a validation walk over the AST to verify that only constant expressions occur in those contexts which require expressions to be constant. (Probably there would be an intermediate AST to determine which functions could be used as constant function calls.)

That will simplify the grammar, and restricts the constant expression analysis to a single well-defined interface, rather than smearing it over the entire grammar.

The second issue results from the fact that brackets are used:

as selectors for generated names (in which case only literal integers are allowed)
as array indices (in which case an arbitrary expression is allowed)
as range definitions, which might look like an array index, but might also be two constant expressions (but not necessarily literals) separated by a colon.

A single selector may follow any component in a hierarchical name, while array indices and range expressions are postfix operators which follow a hierarchical name; if present, the range expression must come last.

As I understand it, selectors and array indices are either required or forbidden, depending on the declaration of the named object. But without knowing the declaration of the identifier, the three uses are not always distinguishable: [1] could be any one of the three possibilities.

In theory, looking the identifier up in the symbol table could resolve the issue. But there are many possible contexts in which an identifier could appear; expecting the lexer to determine when it is appropriate to insist on a selector or index is probably pushing the lexer to duplicate far too much of the work of the parser.

So once again, it seems like it would be much easier to do the analysis in a semantic rule. In this case, I believe, it is possible to do the analysis at parse time because pre-declaration is required, so an appropriate place to do the check would be in the reduction rule for variable_lvalue.

If you go down that road, there is no point overcomplicating the grammar. You can use a grammar which will accept some syntactically incorrect inputs, knowing that all such errors will be detected by the semantic rule. The simplest such grammar just accepts any sequence of .name and [...] suffixes: (See note 1)

hierarchical_lvalue: hierarchical_component
                   | hierarchical_lvalue '.' hierarchical_component
                   | hierarchical_lvalue '[' expression ']'
                   | hierarchical_lvalue '[' expression ':' expression ']'
                   | hierarchical_lvalue '[' expression "+:" expression ']'
                   | hierarchical_lvalue '[' expression "-:" expression ']'

Notes

In this answer, I've avoided the use of ungainly symbolic names for single-character tokens like :. Bison lets you write that as ':' (in the flex action, you return ':';) without even having to declare that as a %token; the result is IMO far more readable and somewhat easier to maintain. I've also used double-quoted token aliases for readability; instead of writing

%token PLUS_COLON %% .... : expression PLUS_COLON expression ...

you can declare the alias

%token PLUS_COLON "+:"
%%
.... : expression "+:" expression ...

In this case, the lexer still needs to use the symbolic name, but the grammar is easier to read. (So are error messages, if you use extended error messages: bison will be able to tell the user that ":", "+:" or "-:" was expected.)

How can Verilog's variable_lvalue be written for Bison?

Answers (1)

Constant and non-constant expressions

Notes

Related Questions

How can Verilog&#39;s variable_lvalue be written for Bison?

Answers (1)

Constant and non-constant expressions

Notes

Related Questions

How can Verilog's variable_lvalue be written for Bison?