wishi
wishi

Reputation: 7387

ANTLR 3 - how do I make unique tokens with NOT across special chars

I have a short question:

// Lexer 
LOOP_NAME   :   (LETTER|DIGIT)+;

OTHERCHARS  :   ~('>' | '}')+;

LETTER      :   ('A'..'Z')|('a'..'z');

DIGIT       :   ('0'..'9');

A_ELEMENT
    :       (LETTER|'_')*(LETTER|DIGIT|'_'|'.');

// Parser-Konfiguration
WS          : ( ' '     
        | '\t'
        | '\r'
            | '\n'
            ) {$channel=HIDDEN;}
            ;

My problem is that this is impossible due to:

As a result, alternative(s) 2 were disabled for that input [14:55:32] error(208): ltxt2.g:61:1: The following token definitions can never be matched because prior tokens match the same input: LETTER,DIGIT,A_ELEMENT,WS

My issue is that I also need to catch UTF8 with OTHERCHARS... and I cannot put all special UTF8 chars into a Lexer rule since I cannot range like ("!".."?").

So I need the NOT (~). The OTHERCHARS here can be everything but ">" or "}". These two close a literal context and are forbidden within.

It doesn't seem such cases are referenced very well, so I'd be happy if someone knew a workaround. The NOT operator here creates the ambivalence I need to solve.

Thanks in advance.

Best, wishi

Upvotes: 0

Views: 312

Answers (1)

Sam Harwell
Sam Harwell

Reputation: 100029

Move OTHERCHARS to the very end of the lexer and define it like this:

OTHERCHARS : . ;

In the Java target, this will match a single UTF-16 code point which is not matched by a previous rule. I typically name the rule ANY_CHAR and treat it as a fall-back. By using . instead of .+, the lexer will only use this rule if no other rule matches.

  1. If another rule matches more than one character, that rule will have priority over ANY_CHAR due to matching a larger number of characters from the input.
  2. If another rule matches exactly one character, that rule will have priority over ANY_CHAR due to appearing earlier in the grammar.

Edit: To exclude } and > from the ANY_CHAR rule, you'll want to create rules for them so they are covered under point 2.

RBRACE   : '}' ;
GT       : '>' ;
ANY_CHAR : . ;

Upvotes: 1

Related Questions