Hakanai
Hakanai

Reputation: 12728

Why isn't ANTLR 4 recognising Unicode characters as valid tokens?

I have been struggling to get ANTLR 4 to recognise Unicode characters in the input.

I reduced my grammar to a simpler test I found on this answer to a related question, but all I've done is change the characters it's supposed to recognise, and it didn't work.

Grammar:

grammar Unicode;

stat: E_CODE* EOF;

E_CODE: '↊' | '↋';

Test class:

class UnicodeTest {
    @Test
    fun `parse unicode`() {
        val lexer = UnicodeLexer(CharStreams.fromString("↊↋"))
        val parser = UnicodeParser(CommonTokenStream(lexer))
        val result = parser.stat().text
        println("Result = <$result>")
        assertThat(result).isEqualTo("↊↋<EOF>")
    }
}

What I get when I run this is:

> Task :test FAILED
line 1:0 token recognition error at: '↊'
line 1:1 token recognition error at: '↋'
Result = <<EOF>>

expected:<"[↊↋]<EOF>"> but was:<"[]<EOF>">
Expected :"[↊↋]<EOF>"
Actual   :"[]<EOF>"

From stderr, it looks like it is correctly pulling the characters from my string as Unicode (it did start as a String so it had better!), but then not recognising the characters as a valid token.

I'm not sure how to debug this sort of thing, because the lexer rules get compiled into a giant blob that I can't figure out how to read. What I can verify is that tokens inside the lexer only contains one element, the EOF.

Ruled out so far:

Workaround?

Upvotes: 1

Views: 676

Answers (1)

Bart Kiers
Bart Kiers

Reputation: 170308

How source files are compiled are (AFAIK) not important.

Using your example grammar as-is, I ran the following tests:

InputStream inputStream = new ByteArrayInputStream("↊↋".getBytes());
UnicodeLexer lexer = new UnicodeLexer(CharStreams.fromStream(inputStream, StandardCharsets.UTF_8));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());

and:

UnicodeLexer lexer = new UnicodeLexer(CharStreams.fromFileName("input.txt", StandardCharsets.UTF_8));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());

(where the file input.txt contains ↊↋)

and both resulted in the following being printed to my console:

↊↋<EOF>

I.e. did you try adding the encoding to the CharStream?

Upvotes: 1

Related Questions