ANTLR Parsing - Ignoring comments on last line of input

Question

I'm using Antlr 3.2 and I'm having trouble writing a grammar to ignore comment lines. Specifically, I'm getting an error if a comment line is the last line of the input with no newline after it.

My input is effectively assembly language, where comments start anywhere on the line with a semi-colon, and go to end of line. Everything else is parsed as commands.

A cutdown version of my grammar that exhibits the problem is:

grammar Test;

options {
    language = Java;
    output = AST;
    ASTLabelType = CommonTree;
}

@header {
    package test;
}

@lexer::header {
    package test;
}

rule
    :   instruction+ EOF!
    ;

instruction
    :   'SET' NEWLINE!*
    ;

COMMENT
    :   ';' .* NEWLINE+ { $channel=HIDDEN; }
    ;

NEWLINE
    :  '
'? '
'
    ;

WS
    :   (' ' | '
' | '
' | '	' | '\f')+ { $channel = HIDDEN; }
    ;

If I use an input like:

; comment line 1 with blank line after it

SET ; comment after command
; comment line again

I get an error when parsing this saying line 4:11 required (...)+ loop did not match anything at character ''.

If I add a newline to the last line of the input, it works fine as the newline is matched by the comment stripping, and the EOF matches at end of rule.

How can I better write this so it ignores comments on the final line but doesn't give an error? I don't want to append anything to the original input to hack it, is there a cleaner way to read comment lines? I've tried all kinds of combinations of NEWLINE|EOF but nothing gets rid of the error.

Bart Kiers · Accepted Answer

Something like this should do it:

COMMENT
    :   ';' ~('
' | '
')* { $channel=HIDDEN; }
    ;

And if you want a COMMENT to potentially have a line break at the end, do:

COMMENT
    :   ';' ~('
' | '
')* NEWLINE? { $channel=HIDDEN; }
    ;

However, the two rules NEWLINE and WS:

NEWLINE
    :  '
'? '
'
    ;

WS
    :   (' ' | '
' | '
' | '	' | '\f')+ { $channel = HIDDEN; }
    ;

are dangerous: ANTLR works like this: it tries to match as much as possible, so the rule that matches the most "wins". If two (or more) rules match the same amount of characters, the one defined first "wins".

In other words, if the lexer sees input like " ", a NEWLINE is created. But if the lexer sees " " (a space followed by a " "), a WS token is created (and put on the HIDDEN channel).

I'm not sure if line breaks really are significant in your language (their not in any flavor of assembly language, AFAIK) so simply remove the NEWLINE rule. If the are significant, remove both the chars and from the WS rule.

ANTLR Parsing - Ignoring comments on last line of input

Answers (1)

Related Questions