Reputation: 300
I am trying to write a parser for a relatively simple but idiosyncratic language.
Simply put, one of the rules is that comment lines are denoted by an asterisk only if that asterisk is the first character of the line. How might I go about formalising such a rule in ANTLR4? I thought about using:
START_LINE_COMMENT: '\n*' .*? '\n' -> skip;
But I am certain this won't work with more than one line comment in a row, as the newline at the end will be consumed as part of the START_LINE_COMMENT
token, meaning any subsequent comment lines will be missing the required initial newline character, which won't work. Is there a way I can perhaps check if the line starts with a '*'
without needing to consume the prior '\n'
?
Upvotes: 3
Views: 1797
Reputation: 3734
Matching a comment line is not easy. As I write one grammar per year, I had to grab to The Definitive ANTLR Reference to refresh my brain. Try this :
grammar Question;
/* Comment line having an * in column 1. */
question
: line+
;
line
// : ( ID | INT )+
: ( ID | INT | MULT )+
;
LINE_COMMENT
: '*' {getCharPositionInLine() == 1}? ~[\r\n]* -> channel(HIDDEN) ;
ID : [a-zA-Z]+ ;
INT : [0-9]+ ;
//WS : [ \t\r\n]+ -> channel(HIDDEN) ;
WS : [ \t\r\n]+ -> skip ;
MULT : '*' ;
Compile and execute :
$ echo $CLASSPATH
.:/usr/local/lib/antlr-4.6-complete.jar:
$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens data.txt
[@0,0:3='line',<ID>,1:0]
[@1,5:5='1',<INT>,1:5]
[@2,9:12='line',<ID>,2:2]
[@3,14:14='2',<INT>,2:7]
[@4,16:26='* comment 1',<LINE_COMMENT>,channel=1,3:0]
[@5,32:35='line',<ID>,4:4]
[@6,37:37='4',<INT>,4:9]
[@7,39:48='*comment 2',<LINE_COMMENT>,channel=1,5:0]
[@8,51:78='* comment 3 after empty line',<LINE_COMMENT>,channel=1,7:0]
[@9,81:81='*',<'*'>,8:1]
[@10,83:85='not',<ID>,8:3]
[@11,87:87='a',<ID>,8:7]
[@12,89:95='comment',<ID>,8:9]
[@13,97:100='line',<ID>,9:0]
[@14,102:102='9',<INT>,9:5]
[@15,107:107='*',<'*'>,9:10]
[@16,109:110='no',<ID>,9:12]
[@17,112:118='comment',<ID>,9:15]
[@18,120:119='<EOF>',<EOF>,10:0]
with the following data.text file :
line 1
line 2
* comment 1
line 4
*comment 2
* comment 3 after empty line
* not a comment
line 9 * no comment
Note that without the MULT
token or '*'
somewhere in a parser rule, the asterisk is not listed in the tokens, but the parser complains :
line 8:1 token recognition error at: '*'
If you display the parsing tree
$ grun Question question -gui data.txt
you'll see that the whole file is absorbed by one line rule. If you need to recognize lines, change the line and white space rules like so :
line
: ( ID | INT | MULT )+ NL
| NL
;
//WS : [ \t\r\n]+ -> skip ;
NL : [\r\n] ;
WS : [ \t]+ -> skip ;
Upvotes: 3