Why isn't ANTLR 4 recognising Unicode characters as valid tokens?

Question

I have been struggling to get ANTLR 4 to recognise Unicode characters in the input.

I reduced my grammar to a simpler test I found on this answer to a related question, but all I've done is change the characters it's supposed to recognise, and it didn't work.

Grammar:

grammar Unicode;

stat: E_CODE* EOF;

E_CODE: '↊' | '↋';

Test class:

class UnicodeTest {
    @Test
    fun `parse unicode`() {
        val lexer = UnicodeLexer(CharStreams.fromString("↊↋"))
        val parser = UnicodeParser(CommonTokenStream(lexer))
        val result = parser.stat().text
        println("Result = <$result>")
        assertThat(result).isEqualTo("↊↋")
    }
}

What I get when I run this is:

> Task :test FAILED
line 1:0 token recognition error at: '↊'
line 1:1 token recognition error at: '↋'
Result = <>

expected:<"[↊↋]"> but was:<"[]">
Expected :"[↊↋]"
Actual   :"[]"

From stderr, it looks like it is correctly pulling the characters from my string as Unicode (it did start as a String so it had better!), but then not recognising the characters as a valid token.

I'm not sure how to debug this sort of thing, because the lexer rules get compiled into a giant blob that I can't figure out how to read. What I can verify is that tokens inside the lexer only contains one element, the EOF.

Ruled out so far:

The grammar file itself is UTF-8.

The Java compiler encoding is definitely set to UTF-8.

tasks.withType {
    // Why is this not yet the default? :(
    options.encoding = "UTF-8"
}

The Kotlin compiler encoding is supposedly always UTF-8 with no option to change that. Mentioned only because I have no idea which compiler is used to compile the Java classes.

When I run tests, those also run as UTF-8.

tasks.withType {
    useJUnitPlatform()
    defaultCharacterEncoding = "UTF-8"
}

I get the same issue when running the code in my main program, where I can see on the command-line that -Dfile.encoding=UTF-8 is on the command-line.

Workaround?

If I change the grammar file to use Unicode escapes explicitly, then it works! So OK, there's something about how ANTLR is reading the file, where it isn't defaulting to UTF-8 as many people are saying it does. I plan to use a lot of Unicode though and would prefer not to have to escape everything. So I guess I just have to find some appropriate Gradle config to force the encoding when its compiler runs. :/

Why isn't ANTLR 4 recognising Unicode characters as valid tokens?

Answers (1)

Related Questions

Why isn&#39;t ANTLR 4 recognising Unicode characters as valid tokens?

Answers (1)

Related Questions

Why isn't ANTLR 4 recognising Unicode characters as valid tokens?