ANTLR proper ordering of grammar rules

Question

I am trying to write a grammar that will recognize <> as a special token but treat as just a regular literal.

Here is my grammar:

grammar test;

doc: item+ ;
item: func | atom ;

func: '<<' WORD '>>' ;
atom: PUNCT+            #punctAtom
    | NEWLINE+          #newlineAtom
    | WORD              #wordAtom
    ;

WS : [ 	] -> skip ;
NEWLINE : [

]+ ;
PUNCT : [.,?!]+ ;
WORD : CHAR+ ;

fragment CHAR : (LETTER | DIGIT | SYMB | PUNCT) ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}

	] ;

So something like <> will be matched by two rules, both func and atom. I want it to be recognized as a func, so I put the func rule first.

When I test my grammar with it treats it as an atom, as expected. However when I test my grammar and give it <> it treats it as an atom as well.

Is there something I'm missing?

PS - I have separated atom into PUNCT, NEWLINE, and WORD and given them labels #punctAtom, #newlineAtom, and #wordAtom because I want to treat each of those differently when I traverse the parse tree. Also, a WORD can contain PUNCT because, for instance, someone can write "Hello," and I want to treat that as a single word (for simplicity later on).

PPS - One thing I've tried is I've included < and > in the last rule, which is a list of symbols that I'm "disallowing" to exist inside a WORD. This solves one problem, in that <> is now recognized as a func, but it creates a new problem because is no longer accepted as an atom.

Bart Kiers · Accepted Answer

ANTLR's lexer tries to match as much characters as possible, so both <> and are matched by the lexer rul WORD. Therefor, there in these cases the tokens << and >> (or < and > for that matter) will not be created.

You can see what tokens are being created by running these lines of code:

Lexer lexer = new testLexer(CharStreams.fromString(" <>"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();

for (Token t : tokens.getTokens()) {
  System.out.printf("%-20s %s
", testLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}

which will print:

WORD                 
WORD                 <>
EOF

What you could do is something like this:

func
 : '<<' WORD '>>' 
 ;

atom
 : PUNCT+   #punctAtom
 | NEWLINE+ #newlineAtom
 | word     #wordAtom
 ;

word
 : WORD
 | '<' WORD '>'
 ;

...

fragment SYMB : ~[<>a-zA-Z0-9.,?! |{}

	] ;

Of course, something like foo will not become a single WORD, which it previously would.

ANTLR proper ordering of grammar rules

Answers (1)

Related Questions