Reputation: 12728
I have been struggling to get ANTLR 4 to recognise Unicode characters in the input.
I reduced my grammar to a simpler test I found on this answer to a related question, but all I've done is change the characters it's supposed to recognise, and it didn't work.
Grammar:
grammar Unicode;
stat: E_CODE* EOF;
E_CODE: '↊' | '↋';
Test class:
class UnicodeTest {
@Test
fun `parse unicode`() {
val lexer = UnicodeLexer(CharStreams.fromString("↊↋"))
val parser = UnicodeParser(CommonTokenStream(lexer))
val result = parser.stat().text
println("Result = <$result>")
assertThat(result).isEqualTo("↊↋<EOF>")
}
}
What I get when I run this is:
> Task :test FAILED
line 1:0 token recognition error at: '↊'
line 1:1 token recognition error at: '↋'
Result = <<EOF>>
expected:<"[↊↋]<EOF>"> but was:<"[]<EOF>">
Expected :"[↊↋]<EOF>"
Actual :"[]<EOF>"
From stderr, it looks like it is correctly pulling the characters from my string as Unicode (it did start as a String
so it had better!), but then not recognising the characters as a valid token.
I'm not sure how to debug this sort of thing, because the lexer rules get compiled into a giant blob that I can't figure out how to read. What I can verify is that tokens
inside the lexer only contains one element, the EOF.
Ruled out so far:
tasks.withType<JavaCompile> {
// Why is this not yet the default? :(
options.encoding = "UTF-8"
}
tasks.withType<Test> {
useJUnitPlatform()
defaultCharacterEncoding = "UTF-8"
}
-Dfile.encoding=UTF-8
is on the command-line.Workaround?
Upvotes: 1
Views: 676
Reputation: 170308
How source files are compiled are (AFAIK) not important.
Using your example grammar as-is, I ran the following tests:
InputStream inputStream = new ByteArrayInputStream("↊↋".getBytes());
UnicodeLexer lexer = new UnicodeLexer(CharStreams.fromStream(inputStream, StandardCharsets.UTF_8));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());
and:
UnicodeLexer lexer = new UnicodeLexer(CharStreams.fromFileName("input.txt", StandardCharsets.UTF_8));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());
(where the file input.txt
contains ↊↋
)
and both resulted in the following being printed to my console:
↊↋<EOF>
I.e. did you try adding the encoding to the CharStream?
Upvotes: 1