Reputation: 76
In my ANTLR grammar I have a set of operations(OP):
whenever I find TOKEN:
the remaining operations are used by other instructions.
Certainly in my Lexer I can't define two tokens as:
OP_MOVE1 : 'I1' | 'I2' | 'I3';
OP_MOVE2 : 'I2' | 'I4' | 'I5';
because I would get::
OP_MOVE2 values unreachable. I2 is always overlapped by token OP_MOVE1
Consequently, imagine that the transactions are not just from I1 to I10 but from I1 to I5000.
One possible solution might be:
LEXER.G4
lexer grammar LexerComment;
CASE : 'CASE' -> pushMode(CASE_MODE);
SWITCH : 'SWITCH' -> pushMode(CASE_SWITCH);
WS : [ \t] -> skip ;
EOL : [\r\n]+;
// ------------ Everything INSIDE a CASE ------------
mode CASE_MODE;
CASE_MODE_MOVE1 : 'I1' | 'I2' | 'I3';
CASE_MODE_WS : [ \t] -> channel(HIDDEN) ;
CASE_MODE_EOL : EOL -> type(EOL),popMode;
// ------------ Everything INSIDE a SWITCH ------------
mode CASE_SWITCH;
CASE_SWITCH_MOVE2: 'I2' | 'I4' | 'I5';
CASE_SWITCH_WS : [ \t] -> channel(HIDDEN) ;
CASE_SWITCHT_EOL : EOL -> type(EOL),popMode;
PARSER.g4:
parser grammar ParserComment;
options {
tokenVocab = LexerComment;
}
prog : (line? EOL)+;
line : instruction;
instruction: CASE CASE_MODE_MOVE1
|SWITCH CASE_SWITCH_MOVE2;
inputFile:
CASE I1
CASE I2
CASE I3
SWITCH I2
SWITCH I4
SWITCH I5
The grammar seems to work correctly, although I'm not satisfied with the solution as it requires a lot of code, 1 mode for each case, and repetition of tokens in common in the modes.
Also because if I wanted to recognize, in addition to CASE and SWITCH, a line that begins with MOVE1 OR MOVE2 as:
instruction: CASE CASE_MODE_MOVE1
|SWITCH CASE_SWITCH_MOVE2
| MOVE1 ;
I have not found an optimal solution to solve the problem:
Is there a way to correctly handle similar cases? Given a set of TOKENs I would like a subset that can be used depending on the context.
possibly trying to avoid having to define every single operation as a basic TOKEN:
fragment I1: 'I1';
fragment I2: 'I2';
etc
EDIT: In my grammar equal TOKEN can have different meanings.
for example with the following grammar the TOKEN I1 has a different meaning.
Parser:
parser grammar ParserComment;
options {
tokenVocab = LexerComment;
}
prog : (line? EOL)+;
line : instruction;
instruction: CONTEXT case_instruction;
case_instruction
: I1
| I2
| I3
;
Lexer:
lexer grammar LexerComment;
// I1 IS A CONTEXT
CONTEXT: I1 | CASE;
CASE : 'CASE';
SWITCH : 'SWITCH';
//OPERATIONS
I1 : 'I1';
I2 : 'I2';
I3 : 'I3';
I4 : 'I4';
I5 : 'I5';
WS : [ \t] -> skip ;
EOL : [\r\n]+;
despite having the same name, I1 has two completely different meanings (CONTEXT and OP). I would like to recognize these two cases and avoid having a common I1:
prog : (line? EOL)+;
line : instruction;
instruction: context case_instruction;
context: I1 | CASE;
case_instruction
: I1
| I2
| I3
;
I had tried to manage with modes in the lexer for this reason.
Upvotes: 0
Views: 187
Reputation: 170227
IMO, you should not let the lexer decide when certain tokens should be created. Let the parser decide when a token is correct in a certain spot.
Something like this:
prog
: instruction* EOF
;
instruction
: CASE case_instruction EOL
| SWITCH switch_instruction EOL
| EOL
;
case_instruction
: I1
| I2
| I3
;
switch_instruction
: I2
| I4
| I5
;
CASE : 'CASE';
SWITCH : 'SWITCH';
I1 : 'I1';
I2 : 'I2';
I3 : 'I3';
I4 : 'I4';
I5 : 'I5';
EOL : '\r'? '\n' | '\r';
SPACES : [ \t]+ -> skip;
Upvotes: 0