Ádám
Ádám

Reputation: 11

Java/terminal issues when getting and printing out utf-8 encoded strings

I've wanted to make an utf 8 supported chatroom, but it didn't work only for some specific utf 8 characters, so after spending a week in pure frustration, i narrowed it down to something being wrong with my user input handling, i asked chatgpt too and read countless forums but i couldn't figure it out.

I'm on Windows, I use vscode, updated version, the terminal there uses utf8 encoding i checked with chcp - returns 65001, same goes for cmd, so i don't think its a problem with the terminal, i tried brute forcing java System.out to be in utf-8 didn't fix it

System.setOut(new PrintStream(System.out, true, StandardCharsets.UTF_8));

I don't have a problem when i have preset a String containing utf8 encoded characters and print that out for example:

String random = "háló";
System.out.println(random);

returns: háló`

I've tried Scanner, BufferedReader, InputStreamReader, converting to bytes

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;

public class Main {
    public static void main(String[] args) {
        try {
            BufferedReader reader = new BufferedReader(
                    new InputStreamReader(System.in, StandardCharsets.UTF_8)
            );
            System.out.println("Enter some text (UTF-8 characters supported):");
            String userInput = reader.readLine();
            
            // Print the user input to verify
            System.out.println("You entered: " + userInput);
            byte[] bytes = userInput.getBytes(StandardCharsets.UTF_8);
            System.out.println(Arrays.toString(bytes));

     
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

returns:

Enter some text (UTF-8 characters supported):
háló
You entered: hl
[104, 0, 108, 0]

note: im a beginner java dev, coming from python

EDIT: New findings/ mentioning previously left out things: so i could run a python chatroom before that was able to use utf 8 encoded characters and print them out correctly, as I mentioned in java I was able to print them out too if i preset them to a variable, but not when i return them from user input

I downloaded intellij and tried out more terminals, and found out it worked correctly in intellij and in bash (with windows subsystem for linux):

Intellij:

Enter some text (UTF-8 characters supported):
háló
You entered: háló
[104, -61, -95, 108, -61, -77]

bash:

root@DESKTOP:/mnt/x/javaProjects/UTF 8 SUFFERING# java Main
Enter some text (UTF-8 characters supported):
háló
You entered: háló
[104, -61, -95, 108, -61, -77]
root@DESKTOP:/mnt/x/javaProjects/UTF 8 SUFFERING# java -version
openjdk version "17.0.11" 2024-04-16
OpenJDK Runtime Environment (build 17.0.11+9-Ubuntu-122.04.1)
OpenJDK 64-Bit Server VM (build 17.0.11+9-Ubuntu-122.04.1, mixed mode, sharing)

it is interesting that it prints out a different byte array. Now i tried cmd, powershell, git bash, vscode terminal and those don't work. I have also tried different fonts as suggested by SedJ601 and it was interesting because by default i used consolas on cmd which supports utf8 characters (so it recognised characters like "á" and displayed them correctly, but when java returned it from user input it didn't work):

Enter some text (UTF-8 characters supported):
háló
You entered: h l
[104, 0, 108, 0]

but when i tried different font for example SimSun-ExtB I got different byte arrays and results:

Enter some text (UTF-8 characters supported):
háló
You entered: h�l�
[104, -17, -65, -67, 108, -17, -65, -67]

so there is something wrong with how java and my terminal interacts

I have recent version of java

X:\javaProjects\UTF 8 SUFFERING>java -version
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
openjdk version "17.0.11" 2024-04-16
OpenJDK Runtime Environment Temurin-17.0.11+9 (build 17.0.11+9)
OpenJDK 64-Bit Server VM Temurin-17.0.11+9 (build 17.0.11+9, mixed mode, sharing)

i tried setting enviorment variables which didn't fix anything:

set JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8
set JAVA_OPTS=-Dfile.encoding=UTF-8

and java is set to utf 8, so im even more confused:

X:\javaProjects\UTF 8 SUFFERING>java -XshowSettings:properties -version
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Property settings:
    file.encoding = UTF-8

maybe java and my terminals use a different lookup table for special characters, or i have no idea. I even changed the system locale to my country, that didn't help either i'm clueless any help would be appreciated!

Upvotes: 1

Views: 96

Answers (0)

Related Questions