Reputation: 695
I am trying to write a grammar that will recognize <<word>>
as a special token but treat <word>
as just a regular literal.
Here is my grammar:
grammar test;
doc: item+ ;
item: func | atom ;
func: '<<' WORD '>>' ;
atom: PUNCT+ #punctAtom
| NEWLINE+ #newlineAtom
| WORD #wordAtom
;
WS : [ \t] -> skip ;
NEWLINE : [\n\r]+ ;
PUNCT : [.,?!]+ ;
WORD : CHAR+ ;
fragment CHAR : (LETTER | DIGIT | SYMB | PUNCT) ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}\n\r\t] ;
So something like <<word>>
will be matched by two rules, both func
and atom
. I want it to be recognized as a func
, so I put the func
rule first.
When I test my grammar with <word>
it treats it as an atom
, as expected. However when I test my grammar and give it <<word>>
it treats it as an atom
as well.
Is there something I'm missing?
PS - I have separated atom
into PUNCT
, NEWLINE
, and WORD
and given them labels #punctAtom
, #newlineAtom
, and #wordAtom
because I want to treat each of those differently when I traverse the parse tree. Also, a WORD
can contain PUNCT
because, for instance, someone can write "Hello," and I want to treat that as a single word (for simplicity later on).
PPS - One thing I've tried is I've included <
and >
in the last rule, which is a list of symbols that I'm "disallowing" to exist inside a WORD
. This solves one problem, in that <<word>>
is now recognized as a func
, but it creates a new problem because <word>
is no longer accepted as an atom
.
Upvotes: 1
Views: 455
Reputation: 170158
ANTLR's lexer tries to match as much characters as possible, so both <<WORD>>
and <WORD>
are matched by the lexer rul WORD
. Therefor, there in these cases the tokens <<
and >>
(or <
and >
for that matter) will not be created.
You can see what tokens are being created by running these lines of code:
Lexer lexer = new testLexer(CharStreams.fromString("<word> <<word>>"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s %s\n", testLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
which will print:
WORD <word>
WORD <<word>>
EOF <EOF>
What you could do is something like this:
func
: '<<' WORD '>>'
;
atom
: PUNCT+ #punctAtom
| NEWLINE+ #newlineAtom
| word #wordAtom
;
word
: WORD
| '<' WORD '>'
;
...
fragment SYMB : ~[<>a-zA-Z0-9.,?! |{}\n\r\t] ;
Of course, something like foo<bar
will not become a single WORD
, which it previously would.
Upvotes: 2