Reputation: 5689
I'm writing a parser/interpreter for a C-like language and I need to interpret escaped characters. One of them is the unicode-escaped sequence with this pattern "\uXXXX" where X is some hex number.
My ANTLR rules look like this:
public char returns [char c]
: '\\"' { $c = '"'; }
| '\\\\' { $c = '\\'; }
| '\\/' { $c = '/'; }
| '\\b' { $c = '\b'; }
| '\\f' { $c = '\f'; }
| '\\n' { $c = '\n'; }
| '\\r' { $c = '\r'; }
| '\\t' { $c = '\t'; }
| '\\u' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT { $c = 'e'; }
| ~('\\' | '"') { $c = '/'; }
;
fragment HEXDIGIT
: ('0'..'9'|'a'..'f'|'A'..'F')
I'm feeding it this string "\u1234" for which I expect an 'e' but I'm getting a '/' instead which is the fallback rule for everything else.
Is there some magic juju going on with fragments and rules or something that I'm not aware of?
Upvotes: 0
Views: 402
Reputation: 170308
As mentioned by Adam, char
is a parser rule at the moment, but should be made a lexer rule instead, in which case you can't let it return a char
(lexer rules always return an instance of a Token
!).
You can adjust the inner-text of a token using its setText(...)
method like this (assuming Java is the target language):
// lexer rules start with a capital!
Char
: '\\"' { setText("\""); }
| '\\\\' { setText("\\"); }
| '\\/' { setText("/"); }
| '\\b' { setText("\b"); }
| '\\f' { setText("\f"); }
| '\\n' { setText("\n"); }
| '\\r' { setText("\r"); }
| '\\t' { setText("\t"); }
| '\\u' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT
{
String hex = getText();
int i = Integer.parseInt(hex.substring(2), 16);
setText(hex + " base 10 = " + i);
}
| ~('\\' | '"')
;
fragment HEXDIGIT
: ('0'..'9'|'a'..'f'|'A'..'F')
;
Upvotes: 1