Pieter Breed
Pieter Breed

Reputation: 5689

ANTLR not matching unicode escaped character

I'm writing a parser/interpreter for a C-like language and I need to interpret escaped characters. One of them is the unicode-escaped sequence with this pattern "\uXXXX" where X is some hex number.

My ANTLR rules look like this:

public char returns [char c] 
    : '\\"' { $c = '"'; } 
    | '\\\\' { $c = '\\'; }
    | '\\/' { $c = '/'; }
    | '\\b' { $c = '\b'; }
    | '\\f' { $c = '\f'; }
    | '\\n' { $c = '\n'; }
    | '\\r' { $c = '\r'; }
    | '\\t' { $c = '\t'; }
    | '\\u' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT { $c = 'e'; }
    | ~('\\' | '"') { $c = '/'; }
    ;

fragment HEXDIGIT
    : ('0'..'9'|'a'..'f'|'A'..'F')

I'm feeding it this string "\u1234" for which I expect an 'e' but I'm getting a '/' instead which is the fallback rule for everything else.

Is there some magic juju going on with fragments and rules or something that I'm not aware of?

Upvotes: 0

Views: 402

Answers (1)

Bart Kiers
Bart Kiers

Reputation: 170308

As mentioned by Adam, char is a parser rule at the moment, but should be made a lexer rule instead, in which case you can't let it return a char (lexer rules always return an instance of a Token!).

You can adjust the inner-text of a token using its setText(...) method like this (assuming Java is the target language):

// lexer rules start with a capital!
Char
  :  '\\"'                                     { setText("\""); } 
  |  '\\\\'                                    { setText("\\"); } 
  |  '\\/'                                     { setText("/"); } 
  |  '\\b'                                     { setText("\b"); } 
  |  '\\f'                                     { setText("\f"); } 
  |  '\\n'                                     { setText("\n"); } 
  |  '\\r'                                     { setText("\r"); } 
  |  '\\t'                                     { setText("\t"); } 
  |  '\\u' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT 
     { 
       String hex = getText();
       int i = Integer.parseInt(hex.substring(2), 16);
       setText(hex + " base 10 = " + i);
     } 
  |  ~('\\' | '"')
  ;

fragment HEXDIGIT
  :  ('0'..'9'|'a'..'f'|'A'..'F')
  ;

Upvotes: 1

Related Questions