rooms
rooms

Reputation: 300

How can I check if first character of a line is "*" in ANTLR4?

I am trying to write a parser for a relatively simple but idiosyncratic language.

Simply put, one of the rules is that comment lines are denoted by an asterisk only if that asterisk is the first character of the line. How might I go about formalising such a rule in ANTLR4? I thought about using:

START_LINE_COMMENT: '\n*' .*? '\n' -> skip; 

But I am certain this won't work with more than one line comment in a row, as the newline at the end will be consumed as part of the START_LINE_COMMENTtoken, meaning any subsequent comment lines will be missing the required initial newline character, which won't work. Is there a way I can perhaps check if the line starts with a '*' without needing to consume the prior '\n'?

Upvotes: 3

Views: 1797

Answers (1)

BernardK
BernardK

Reputation: 3734

Matching a comment line is not easy. As I write one grammar per year, I had to grab to The Definitive ANTLR Reference to refresh my brain. Try this :

grammar Question;

/* Comment line having an * in column 1. */

question
    :   line+
    ;

line
//    :   ( ID | INT )+
    :   ( ID | INT | MULT )+
    ;

LINE_COMMENT
    :   '*' {getCharPositionInLine() == 1}? ~[\r\n]* -> channel(HIDDEN) ;
ID  :   [a-zA-Z]+ ;
INT :   [0-9]+ ;
//WS  :   [ \t\r\n]+ -> channel(HIDDEN) ;
WS  :   [ \t\r\n]+ -> skip ;
MULT : '*' ;

Compile and execute :

$ echo $CLASSPATH
.:/usr/local/lib/antlr-4.6-complete.jar:
$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4 
$ javac Q*.java
$ grun Question question -tokens data.txt 
[@0,0:3='line',<ID>,1:0]
[@1,5:5='1',<INT>,1:5]
[@2,9:12='line',<ID>,2:2]
[@3,14:14='2',<INT>,2:7]
[@4,16:26='* comment 1',<LINE_COMMENT>,channel=1,3:0]
[@5,32:35='line',<ID>,4:4]
[@6,37:37='4',<INT>,4:9]
[@7,39:48='*comment 2',<LINE_COMMENT>,channel=1,5:0]
[@8,51:78='* comment 3 after empty line',<LINE_COMMENT>,channel=1,7:0]
[@9,81:81='*',<'*'>,8:1]
[@10,83:85='not',<ID>,8:3]
[@11,87:87='a',<ID>,8:7]
[@12,89:95='comment',<ID>,8:9]
[@13,97:100='line',<ID>,9:0]
[@14,102:102='9',<INT>,9:5]
[@15,107:107='*',<'*'>,9:10]
[@16,109:110='no',<ID>,9:12]
[@17,112:118='comment',<ID>,9:15]
[@18,120:119='<EOF>',<EOF>,10:0]

with the following data.text file :

line 1
        line 2
* comment 1
    line 4
*comment 2

* comment 3 after empty line
 * not a comment
line 9    * no comment

Note that without the MULT token or '*' somewhere in a parser rule, the asterisk is not listed in the tokens, but the parser complains :

line 8:1 token recognition error at: '*'

If you display the parsing tree

$ grun Question question -gui data.txt

you'll see that the whole file is absorbed by one line rule. If you need to recognize lines, change the line and white space rules like so :

line
    :   ( ID | INT | MULT )+ NL
    |   NL
    ;

//WS  :   [ \t\r\n]+ -> skip ;
NL  :   [\r\n] ;
WS  :   [ \t]+ -> skip ;

Upvotes: 3

Related Questions