Sgnes
Sgnes

Reputation: 11

Antlr4 how to discard all unmatched input

I want to extra all preprocess statement in C source file, and ignore all other statement. I'v tried add a last rule like Unknown : . -> skip ; // or -> channel(HIDDEN) ;in the lexer, or in the parser, add a last rule like:ignored : . ;, but it does not work.

Here is my grammar :

grammar PreProcessStatement;


pre_if_statement
: pre_if pre_elif* pre_else? pre_endif
;

pre_if      :   PreProcessBegin 'if'    statement;
pre_endif   :   PreProcessBegin 'endif' ;
pre_else    :   PreProcessBegin 'else'  ;
pre_elif    :   PreProcessBegin 'elif'statement ;
pre_define  :   PreProcessBegin 'define' statement;
pre_undef   :   PreProcessBegin 'undef'statement    ;
pre_pragma  :   PreProcessBegin 'pragma'statement;

statement
: IDENTIFIER
| statement Condition statement
| '(' statement (Condition | Logic_or | Logic_and) statement ')'
| statement (Logic_or | Logic_and) statement
;



Logic_or
: '||'
;

Logic_and
: '&&'
;
PreProcessBegin :   '#'     ;
Condition       : '==' | '>' | '>='|  '<' | '<='    ;
NUM             : INT | HEX     ;
STRID           : '"'ID'"'  ;
IDENTIFIER      : [a-zA-Z_0-9]+ ;
ID              :   [a-zA-Z_]+ ;
INT             :   [0-9]+ ;
HEX             : '0x'INT;
WS              :   [ \t\n\r]+ -> skip ;
NewLine         : ('\n' | '\r' | '\n\r');
MulLine     : '\\' NewLine -> skip ;
Unknown : .*? -> skip ; // or -> channel(HIDDEN) ;

Input:

#if (test == ttt)
#elif rrrr
#else
aaa
#endif

Error:

line 4:0 extraneous input 'aaa' expecting '#'

I'v read the link below, does not work. Skipping unmatched input in Antlr

What's wrong with my grammar?

Upvotes: 1

Views: 892

Answers (1)

quepas
quepas

Reputation: 1003

Explanation

The aaa input won't match with Unknown token. It will match with IDENTIFIER : [a-zA-Z_0-9]+ token which is defined before Unknown lexeme.

Solutions

Modify token

Put the Unknown lexeme definition before others tokens. Add to this lexeme a semantic predicate which will check if the first character in the line is not a # character. If it is true then skip the whole line until the NewLine token.

Unknown : {getCharPositionInLine() == 0 && _input.LA(1) != '#'}? .*? NewLine -> skip;

Use lexer modes

When you spot a # character enter a new lexer mode PREPROCESSOR. This allows us from now on to use only tokens defined within the PREPROCESSOR mode. Exit from this mode when a new line occurs. So when we are out of the mode we are looking for two tokens: PreProcessBegin (line started with # character) and Unknown (line without a #). Otherwise in PREPROCESSOR mode we will match the statements like in any other, regular language.

Example of the lexer:

PreProcessBegin : '#' -> pushMode(PREPROCESSOR); // enter mode
Unknown : .*? NewLine -> skip;                   // or skip the line

mode PREPROCESSOR; // when in PREPROCESSOR mode use defined below tokens
(...)
Condition : '==' | '>' | '>='|  '<' | '<=';
IDENTIFIER : [a-zA-Z_0-9]+ ;
ID : [a-zA-Z_]+ ;
INT : [0-9]+ ;
(...)
NewLine : ('\n' | '\r' | '\n\r') -> popMode; // exit mode

Upvotes: 1

Related Questions