Reputation: 8695

Tokenizing Ellipsis in a Programming Language to Avoid Floating Points

I am designing a language where I want to use .. to define an integer range. The problem is that 0..10 is tokenized as the floats 0. and .10.

How do I allow support this syntax with flex? Is it is simple as making 0. an invalid float?

Upvotes: 1

Answers (3)

brian beuning

Reputation: 2862

Using Kaz's answer, I solved ellipsis lexing like below. I had to split one rule into two to get it to work.

/* numbers like 5.1 and .5 */

[0-9]+.[0-9]+|.[0-9]+ { }

/* numbers like 5. but not 5.. */
/* The trailing context helps us tokenize 1..10 as INT, DOTDOT, INT */
/* instead of FLOAT, DOT, INT */

[0-9]+./[^.] { }

Upvotes: 0

Kaz

Reputation: 58617

March 9, 2012:

If you don't want to make 0. an invalid float, you can use a Lex trailing context:

[0-9]+[.]/[^.]  { /* recognize <digits>. only if not followed by another . */ }

So this rule will be rejected based on the trailing context mismatch, and the token matches only as an integer constant. Then the next input is .. which can get recognized as a dot-dot token.

March 20, 2012:

I'm going to have to eat my own advice now!

Regression from adding floating point support:

$ ./txr -v -l -c '@(bind a (1..3))'
spec:
(((bind a (1.0 0.3))))  # oops, should be (cons 1 3)
bindings:
nil
(a 1.0 0.3)

:) Same problem. DOTDOT token in the language, and floats that can begin with a point and end with a point.

My proposed solution is not going to be that straightforward to apply because a floating point token is matched with a single complex regex, where the match for the decimal point is in the middle somewhere, prior to some optional digits and exponent section. The lex trailing context can only be at the end.

It will be necessary to change the expression so that it does not recognize 123. as a valid token, and recognize 123. with an additional rule which has the trailing context.

Working fix from actual code:

diff --git a/parser.l b/parser.l
index d8fd915..449cc14 100644
--- a/parser.l
+++ b/parser.l
@@ -149,8 +149,12 @@ static wchar_t num_esc(char *num)
 %option noinput

 SYM     [a-zA-Z0-9_]+
-NUM     [+\-]?[0-9]+
-FLO     [+\-]?([0-9]+[.]?[0-9]*|[0-9]*[.][0-9]+)([eE][+-]?[0-9]+)?
+SGN     [+\-]
+EXP     [eE][+\-]?[0-9]+
+DIG     [0-9]
+NUM     {SGN}?{DIG}+
+FLO     {SGN}?{DIG}*[.]({DIG}+{EXP}?|{EXP})
+FLODOT  {SGN}?{DIG}+[.]
 BSCHR   [a-zA-Z0-9!$%&*+\-<=>?\\^_~]
 BSYM    {BSCHR}({BSCHR}|#)*
 NSCHR   [a-zA-Z0-9!$%&*+\-<=>?\\^_~/]
@@ -190,7 +194,8 @@ UONLY   {U2}{U}|{U3}{U}{U}|{U4}{U}{U}{U}
   return NUMBER;
 }

-<SPECIAL,NESTED,BRACED>{FLO} {
+<SPECIAL,NESTED,BRACED>{FLODOT}/[^.] |
+<SPECIAL,NESTED,BRACED>{FLO}         {
   val str = string_own(utf8_dup_from(yytext));

   if (yy_top_state() == INITIAL

So floating constants are of two forms now:

optional sign, followed by zero or more digits, and a decimal point followed by either an exponent, or digits and optionally an exponent.
digits followed by dot, with a trailing context asserting that the next char is not a dot.

Note how the above demonstrates the technique of multiple patterns (with their own flex start states, too) which share the same rule:

PATTERN1 |
PATTERN2 |
...
PATTERNN { action; }

The patterns can individually have trailing contexts, or not.

P.S. If I change my mind and disallow 123., I just remove all traces of FLODOT.

Bug:

This code has to be reversed:

<SPECIAL,NESTED,BRACED>{FLODOT}/[^.] |
<SPECIAL,NESTED,BRACED>{FLO}         {

I.e.

<SPECIAL,NESTED,BRACED>{FLO}         |
<SPECIAL,NESTED,BRACED>{FLODOT}/[^.] {

Looks like flex is considering the matched trailing context to part of the length of the match. I.e 3.0 is considered a three character match in the FLODOT case even though the extracted token is 3.. So we must put this case second so that 3.0 comes out via the FLO match.

Upvotes: 3

Eduardo

Reputation: 8402

This is really up to how you define your lexer. If you define the ellipsis as being a token made up of two periods, and also define your floating point numbers correctly, there should not be a conflict. Just make sure the token for the ellipsis is defined first in your specification. And, yes, 0. should be an invalid float.