Reputation: 179
I am trying to create a simple preprocessor in ANTLR. My grammar looks like this:
grammar simple_preprocessor;
ifdef_statement : POUND_IFDEF IDENTIFIER ;
else_statement : POUND_ELSE ;
endif_statement : POUND_ENDIF ;
preprocessor_statement :
ifdef_statement
code_block
else_statement
code_block
endif_statement
;
code_file : (preprocessor_statement | code_block)+ EOF ;
code_block : TEXT ;
POUND_IFDEF : '#IFDEF';
POUND_ELSE : '#ELSE';
POUND_ENDIF : '#ENDIF';
IDENTIFIER : ID_START ID_CONTINUE* ;
TEXT : ~[\u000C]+ ;
fragment ID_START : '_' | [A-Z] | [a-z] ;
fragment ID_CONTINUE : ID_START | [0-9] ;
WS : [ \t\r\n\u000C]+ -> channel(HIDDEN) ;
Then I parse the following using the code_file() rule:
#IFDEF one
print "1"
#ELSE
print "2"
#ENDIF
The string tree looks like this:
(code_file (code_block \n#IFDEF one\n print "1"\n#ELSE\n print "2"\n#ENDIF\n) <EOF>)
Not what I want, because the preprocessor tokens are being treated as text and match the code_block rule.
I read the "Islands in the Stream" chapter in the ANTLR book, and the XML example makes sense, but it relies on TEXT not containing two specific characters:
TEXT : ~[<&]+ ;
If I really have to, I suppose I could exclude the # character:
TEXT : ~[#]+ ;
But I'm hoping there's a better way to tell ANTLR to exclude my preprocessor tokens so it can distinguish them from generic code.
Thanks for any help.
Upvotes: 3
Views: 1814
Reputation: 5991
Use a lexical mode to separate the preprocessor directives from the ordinary text definition of your base grammar. Use the \n#
and next \n
as your mode guards.
PStart : '\n#' -> channel(HIDDEN), pushMode(PreProc) ;
mode PreProc ;
PIFDEF : 'IFDEF' PTEXT* ;
PELSE : 'ELSE' ;
PENDIF : 'ENDIF' ;
PTEXT : [a-zA-Z0-9_-]+ ;
PEOL : [\r\n]+ -> channel(HIDDEN), popMode ;
PWS : [ \t]+ -> channel(HIDDEN) ;
// maybe PCOMMENT ?
Update - to consolidate the full text of the directives into single tokens:
PIFDEF : 'IFDEF' PTEXT* PEOL -> popMode ;
PELSE : 'ELSE' PEOL -> popMode ;
PENDIF : 'ENDIF' PEOL -> popMode ;
PTEXT : [ \ta-zA-Z0-9_-]+ ;
PEOL : [\r\n] ;
This is not typically the direction you want to go - generally you want to have greater decomposition rather than less. For example, this might be better while still producing visible EOLs.
mode PreProc ;
PIFDEF : 'IFDEF' ;
PELSE : 'ELSE' ;
PENDIF : 'ENDIF' ;
PTEXT : [a-zA-Z0-9_-]+ ;
PEOL : '\r'? '\n' -> popMode ;
PWS : [ \t]+ -> channel(HIDDEN) ;
PCMT : '//' ~[\r\n]* -> channel(HIDDEN) ;
This way the preproc command tokens are discrete and a sequence of one or more PTEXTs contain only the preproc identifier. Emitting PEOLs appears redundant, but is not necessarily wrong. Parser rules to demonstrate:
preproc : ifdef | else | endif ;
ifdef : PIFDEF PTEXT+ PEOL ; // the rules are unambiguous
else : PELSE PEOL ; // even without matching the PEOLs
endif : PENDIF PEOL ;
Upvotes: 3