Juan Torres
Juan Torres

Reputation: 744

Flex: matching signed int vs addition/subtraction

I have a flex file that returns me tokens depending on an input file. I currently have it set up to return a token INT when I see a signed int in flex, and a token ADD or SUB when I see an addition or subtraction.

I have a macro defined for a signed int in flex

signedint = [+-]?[0-9]+

When flex sees this pattern, it returns an INT token.

However, I have run into the problem of being able to distinguish a signed integer from an addition or subtraction. For instance,

5 + 2

returns 3 tokens: INT, ADD, INT. This is fine.

When the spaces are missing, however, I'm running into trouble:

5+2
5 +2

Either of these return just two tokens (2 ints), since flex sees 5 as a signedint (correctly) and returns it, and then sees '+2' and sees another signed int (it doesn't return anything for whitespace).

Any ideas on how to get it to distinguish between a signed int and an addition/subtraction?

Upvotes: 2

Views: 528

Answers (2)

rici
rici

Reputation: 241771

The difficulty of solving this problem at the lexical level is the reason that many languages don't even try. The parser can easily distinguish between a unary prefix + or - operator and a binary infix operator with the same spelling, and aside from one corner case (see below), there is no real semantic difference between -2 considered as a prefix minus followed by an integer constant, and -2 considered as a single token. In the parser, you can use constant folding to evaluate constant sub-expressions if you don't want the operators to appear in the AST.

It is only possible to distinguish between infix and prefix operators during the lexical scan by maintaining lexical state, which effectively duplicates part of the parsing algorithm into a hand-built state machine in the lexical scanner. In the case of ordinary arithmetic expressions, it is only a small part of the parsing algorithm, but even so it is not pretty and complicates the verification of correctness of the lexer/parser combination.

Leaving out operator precedence and associativity, which are not relevant here, the grammar for an arithmetic expression can be reduced to something like the following:

expr: term
    | expr INFIX term
term: CONSTANT | VARIABLE
    | '(' expr ')'
    | PREFIX term

(That omits postfix operators, including function calls and array subscripts, but the principle is not affected by that.)

From that grammar, it is easy to derive FIRST, LAST and FOLLOW sets. term can only start with a terminal, and expr can only start with a term, so they both have the same FIRST set:

FIRST(expr) = FIRST(term) = { (, PREFIX, CONSTANT, VARIABLE }

By similar reasoning, the LAST sets of term and expr are also identical:

LAST(expr) = LAST(term) = { ), CONSTANT, VARIABLE }

And finally, the FOLLOW sets for the non-terminals, based on the observation that a term only appears at the end of an expr, and the expr is either at the end of the input, or appears in the grammar followed by a terminal:

FOLLOW(term) = FOLLOW(expr) = { ), INFIX, $ }

(where $ is the end-of-input mark.)

All that lets us compute FOLLOW sets for terminals, using the observation that for each terminal A in LAST of non-terminal V can only be followed by a terminal in FOLLOW(V). (In some grammars that might overestimate the FOLLOW sets, but it doesn't matter in this case.) Which finally gives us:

 terminal           can be followed by
 --------           ------------------
 INFIX              PREFIX, (, CONSTANT, VARIABLE
 (                  PREFIX, (, CONSTANT, VARIABLE
 PREFIX             PREFIX, (, CONSTANT, VARIABLE
 )                  INFIX, ), $
 CONSTANT           INFIX, ), $
 VARIABLE           INFIX, ), $

In short, PREFIX and INFIX never occur in the same context. If the previous token was INFIX, PREFIX or ( (or there was no previous token), then an operator must be PREFIX. Otherwise, the operator must be INFIX. And we can implement that in flex using two start conditions: one for the case where we might see a CONSTANT, and the other for the case where we cannot legally see a CONSTANT. The first one is the INITIAL state.

That translates into the following little flex description:

%x FINAL
%%

<INITIAL>[-+]?[[:digit:]]+  {
                              BEGIN(FINAL); return CONSTANT;
                            }
<INITIAL>[[:alpha:]][[:alnum:]]* {
                              BEGIN(FINAL); return VARIABLE;
                            }
<INITIAL>[(]                  return yytext[0];
<INITIAL>[-]                  return PREFIX_MINUS;
<INITIAL>[+]                  return PREFIX_PLUS;

<FINAL>[)]                    return yytext[0];
<FINAL>[-+*/] BEGIN(INITIAL); return yytext[0];

<*>.                          /* Signal invalid token */

That's not complete, of course. It leaves out the setting of yylval and the handling of input which is not part of the expression (including newline and other whitespace).

Although this solution works, it's easy to see why leaving the issue to the parser is preferred:

%%
[-+*/()]                 return yytext[0];
[[:digit:]]+             return CONSTANT;
[[:alpha:]][[:alnum:]]*  return VARIABLE;

But there is one corner case which needs to be handled carefully. In an N-bit 2's complement representation, it is possible to represent -2N but it is not possible to represent +2N, since the maximum positive number is 2N−1. If signed integers are deferred to the parser as expressions, it is vital that the integer 2N be permitted, even though it will not fit into the signed integer type being used.

That can be accomplished by using an unsigned integer type to pass integer values to the parser, which in turn means that the parser will need to detect the overflow condition.

As it happens, that's not how C handles this case, which leads to an interesting anomaly. In C (as above), an integer constant does not include a sign; -2 is two tokens. But integer constants do include an (implicit) type, which in the case of a decimal integer is the smallest signed integer type which can hold the value of the constant. Since unary negation preserves type, the result is that on a 32-bit machine, -2147483648 is of type long (or long long) even though it is representable as an int. This occasionally causes confusion.

Upvotes: 4

SKalariya
SKalariya

Reputation: 48

signedint = \s*[+-]?\s*[0-9]+\s*

This should work

Upvotes: -2

Related Questions