Reputation: 11
I'm trying to write an antlr3 grammar for a small DSL with unicode support (needed for german umlauts, äöüÄÖÜß), but I can't seem to get it to work.
I've written a minimal test grammar that is supposed to match on any sequence of unicode characters, like "xay" (which works just fine) or "xäy" (which doesn't.)
Here's the grammar:
grammar X;
@lexer::header {
import org.antlr.runtime.ANTLRInputStream;
import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CommonTokenStream;
import java.io.ByteArrayInputStream;
import java.io.IOException;
}
@lexer::members {
public static void main(String[] args) throws Exception {
ANTLRInputStream stream = new ANTLRInputStream( new ByteArrayInputStream("x\u00C4y".getBytes()), "utf-8");
XLexer lex = new XLexer(stream);
CommonTokenStream c = new CommonTokenStream(lex);
XParser p = new XParser(c);
p.x();
}
}
x : UTF8+;
UTF8 : ('\u0000'..'\uF8FF');
For "xäx" I'm getting the following error:
line 1:1 mismatched character '?' expecting set null
What am I missing?
Thanks!
Upvotes: 1
Views: 1031
Reputation: 69977
I compiled your grammar (using Antlr 3.4), and it worked for me without problems. Here is what I did precisely:
$ java -jar antlr-3.4-complete-no-antlrv2.jar X.g
$ javac -cp antlr-3.4-complete-no-antlrv2.jar XLexer.java XParser.java
$ CLASSPATH=$CLASSPATH:./antlr-3.4-complete-no-antlrv2.jar:./XLexer.class:./XParser.class java XLexer
I also inserted some code to print the string to STDOUT before parsing it, and it printed the expected string xÄy
.
One idea, though: Perhaps your default encoding (which, I think, is specified in the file.encoding
property at JVM start-up time) is set to something other than UTF-8. To test this, try specifying the encoding explicitly in the call to getBytes()
:
ANTLRInputStream stream = new ANTLRInputStream( new ByteArrayInputStream("x\u00C4y".getBytes("UTF-8")), "utf-8");
Upvotes: 2