ANTLR not matching unicode escaped character

Question

I'm writing a parser/interpreter for a C-like language and I need to interpret escaped characters. One of them is the unicode-escaped sequence with this pattern "\uXXXX" where X is some hex number.

My ANTLR rules look like this:

public char returns [char c] 
    : '\"' { $c = '"'; } 
    | '\\' { $c = '\'; }
    | '\/' { $c = '/'; }
    | '\b' { $c = '\b'; }
    | '\f' { $c = '\f'; }
    | '\n' { $c = '
'; }
    | '\r' { $c = '
'; }
    | '\t' { $c = '	'; }
    | '\u' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT { $c = 'e'; }
    | ~('\' | '"') { $c = '/'; }
    ;

fragment HEXDIGIT
    : ('0'..'9'|'a'..'f'|'A'..'F')

I'm feeding it this string "\u1234" for which I expect an 'e' but I'm getting a '/' instead which is the fallback rule for everything else.

Is there some magic juju going on with fragments and rules or something that I'm not aware of?

Bart Kiers · Accepted Answer

As mentioned by Adam, char is a parser rule at the moment, but should be made a lexer rule instead, in which case you can't let it return a char (lexer rules always return an instance of a Token!).

You can adjust the inner-text of a token using its setText(...) method like this (assuming Java is the target language):

// lexer rules start with a capital!
Char
  :  '\"'                                     { setText("""); } 
  |  '\\'                                    { setText("\"); } 
  |  '\/'                                     { setText("/"); } 
  |  '\b'                                     { setText("\b"); } 
  |  '\f'                                     { setText("\f"); } 
  |  '\n'                                     { setText("
"); } 
  |  '\r'                                     { setText("
"); } 
  |  '\t'                                     { setText("	"); } 
  |  '\u' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT 
     { 
       String hex = getText();
       int i = Integer.parseInt(hex.substring(2), 16);
       setText(hex + " base 10 = " + i);
     } 
  |  ~('\' | '"')
  ;

fragment HEXDIGIT
  :  ('0'..'9'|'a'..'f'|'A'..'F')
  ;

ANTLR not matching unicode escaped character

Answers (1)

Related Questions