Reputation: 678
I'm working on a lexer and parser for an old object oriented chat system (MOO in case any readers are familiar with its language). Within this language, any of the below examples are valid floating point numbers:
2.3
3.
.2
3e+5
The language also implements an indexing syntax for extracting one or more characters from a string or list (which is a set of comma separated expressions enclosed in curly braces). The problem arises from the fact that the language supports a range operator inside the index brackets. For example: a = foo[1..3];
I understand that ANTLR wants to match the longest possible match first. Unfortunately this results in the lexer seeing '1..3' as two floating points numbers (1. and .3), rather than two integers with a range operator ('..') between them. Is there any way to solve this short of using lexer modes? Given that the values inside of an indexing expression can be any valid expression, I would have to duplicate a lot of token rules (essentially all but the floating point numbers as I understand it). Now granted I'm new to ANTLR so I'm sure I'm missing something and any help is much appreciated. I will supply my lexer grammar below:
lexer grammar MooLexer;
channels { COMMENTS_CHANNEL }
SINGLE_LINE_COMMENT
: '//' INPUT_CHARACTER* -> channel(COMMENTS_CHANNEL);
DELIMITED_COMMENT
: '/*' .*? '*/' -> channel(COMMENTS_CHANNEL);
WS
: [ \t\r\n] -> channel(HIDDEN)
;
IF
: I F
;
ELSE
: E L S E
;
ELSEIF
: E L S E I F
;
ENDIF
: E N D I F
;
FOR
: F O R;
ENDFOR
: E N D F O R;
WHILE
: W H I L E
;
ENDWHILE
: E N D W H I L E
;
FORK
: F O R K
;
ENDFORK
: E N D F O R K
;
RETURN
: R E T U R N
;
BREAK
: B R E A K
;
CONTINUE
: C O N T I N U E
;
TRY
: T R Y
;
EXCEPT
: E X C E P T
;
ENDTRY
: E N D T R Y
;
IN
: I N
;
SPLICER
: '@';
UNDERSCORE
: '_';
DOLLAR
: '$';
SEMI
: ';';
COLON
: ':';
DOT
: '.';
COMMA
: ',';
BANG
: '!';
OPEN_QUOTE
: '`';
SINGLE_QUOTE
: '\'';
LEFT_BRACKET
: '[';
RIGHT_BRACKET
: ']';
LEFT_CURLY_BRACE
: '{';
RIGHT_CURLY_BRACE
: '}';
LEFT_PARENTHESIS
: '(';
RIGHT_PARENTHESIS
: ')';
PLUS
: '+';
MINUS
: '-';
STAR
: '*';
DIV
: '/';
PERCENT
: '%';
PIPE
: '|';
CARET
: '^';
ASSIGNMENT
: '=';
QMARK
: '?';
OP_AND
: '&&';
OP_OR
: '||';
OP_EQUALS
: '==';
OP_NOT_EQUAL
: '!=';
OP_LESS_THAN
: '<';
OP_GREATER_THAN
: '>';
OP_LESS_THAN_OR_EQUAL_TO
: '<=';
OP_GREATER_THAN_OR_EQUAL_TO
: '>=';
RANGE
: '..';
ERROR
: 'E_NONE'
| 'E_TYPE'
| 'E_DIV'
| 'E_PERM'
| 'E_PROPNF'
| 'E_VERBNF'
| 'E_VARNF'
| 'E_INVIND'
| 'E_RECMOVE'
| 'E_MAXREC'
| 'E_RANGE'
| 'E_ARGS'
| 'E_NACC'
| 'E_INVARG'
| 'E_QUOTA'
| 'E_FLOAT'
;
OBJECT
: '#' DIGIT+
| '#-' DIGIT+
;
STRING
: '"' ( ESC | [ !] | [#-[] | [\]-~] | [\t] )* '"';
INTEGER
: DIGIT+;
FLOAT
: DIGIT+ [.] (DIGIT*)? (EXPONENTNOTATION EXPONENTSIGN DIGIT+)?
| [.] DIGIT+ (EXPONENTNOTATION EXPONENTSIGN DIGIT+)?
| DIGIT+ EXPONENTNOTATION EXPONENTSIGN DIGIT+
;
IDENTIFIER
: (LETTER | DIGIT | UNDERSCORE)+
;
LETTER
: LOWERCASE
| UPPERCASE
;
/*
* fragments
*/
fragment LOWERCASE
: [a-z] ;
fragment UPPERCASE
: [A-Z] ;
fragment EXPONENTNOTATION
: ('E' | 'e');
fragment EXPONENTSIGN
: ('-' | '+');
fragment DIGIT
: [0-9] ;
fragment ESC
: '\\"' | '\\\\' ;
fragment INPUT_CHARACTER
: ~[\r\n\u0085\u2028\u2029];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
Upvotes: 2
Views: 720
Reputation: 170158
No, AFAIK, there is no way to solve this using lexer modes. You'll need a predicate with a bit of target specific code. If Java is your target, that might look like this:
lexer grammar RangeTestLexer;
FLOAT
: [0-9]+ '.' [0-9]+
| [0-9]+ '.' {_input.LA(1) != '.'}?
| '.' [0-9]+
;
INTEGER
: [0-9]+
;
RANGE
: '..'
;
SPACES
: [ \t\r\n] -> skip
;
If you run the following Java code:
Lexer lexer = new RangeTestLexer(CharStreams.fromString("1 .2 3. 4.5 6..7 8 .. 9"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s `%s`\n", RangeTestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
you'll get the following output:
INTEGER `1`
FLOAT `.2`
FLOAT `3.`
FLOAT `4.5`
INTEGER `6`
RANGE `..`
INTEGER `7`
INTEGER `8`
RANGE `..`
INTEGER `9`
EOF `<EOF>`
The { ... }?
is the predicate and the embedded code must evaluate to a boolean. In my example, the Java code _input.LA(1) != '.'
returns true if the character stream 1 step ahead of the current position does not equal a '.'
char.
Upvotes: 3