fscld
fscld

Reputation: 11

antlr3 unicode characters cause error

I'm trying to write an antlr3 grammar for a small DSL with unicode support (needed for german umlauts, äöüÄÖÜß), but I can't seem to get it to work.

I've written a minimal test grammar that is supposed to match on any sequence of unicode characters, like "xay" (which works just fine) or "xäy" (which doesn't.)

Here's the grammar:

grammar X;

@lexer::header {
import org.antlr.runtime.ANTLRInputStream;
import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CommonTokenStream;
import java.io.ByteArrayInputStream;
import java.io.IOException;
}

@lexer::members { 

    public static void main(String[] args) throws Exception {
        ANTLRInputStream stream = new ANTLRInputStream( new ByteArrayInputStream("x\u00C4y".getBytes()), "utf-8");
        XLexer lex = new XLexer(stream);
        CommonTokenStream c = new CommonTokenStream(lex);
        XParser p = new XParser(c);
        p.x();
    }

}

x   :    UTF8+;

UTF8 :  ('\u0000'..'\uF8FF');

For "xäx" I'm getting the following error:

line 1:1 mismatched character '?' expecting set null

What am I missing?

Thanks!

Upvotes: 1

Views: 1031

Answers (1)

jogojapan
jogojapan

Reputation: 69977

I compiled your grammar (using Antlr 3.4), and it worked for me without problems. Here is what I did precisely:

$ java -jar antlr-3.4-complete-no-antlrv2.jar X.g
$ javac -cp antlr-3.4-complete-no-antlrv2.jar XLexer.java XParser.java
$ CLASSPATH=$CLASSPATH:./antlr-3.4-complete-no-antlrv2.jar:./XLexer.class:./XParser.class java XLexer

I also inserted some code to print the string to STDOUT before parsing it, and it printed the expected string xÄy.

One idea, though: Perhaps your default encoding (which, I think, is specified in the file.encoding property at JVM start-up time) is set to something other than UTF-8. To test this, try specifying the encoding explicitly in the call to getBytes():

ANTLRInputStream stream = new ANTLRInputStream( new ByteArrayInputStream("x\u00C4y".getBytes("UTF-8")), "utf-8");

Upvotes: 2

Related Questions