Yassin Hajaj
Yassin Hajaj

Reputation: 22005

UTF-8 does not print characters to the console

I have the following code

public class MainDefault {
        public static void main (String[] args) {
                System.out.println("²³");
                System.out.println(Arrays.toString("²³".getBytes()));
        }
}

But can't seem to print the special characters to the console

When I do the following, I get the following result

$ javac MainDefault.java
$ java MainDefault

MainDefaultPrinting

On the other hand, when I compile it and run it like this

$ javac -encoding UTF8 MainDefault.java
$ java MainDefault

MainDefaultUTF8CompilationOnly

And when I run it using the file encoding UTF8 flag, I get the following

$ java -Dfile.encoding=UTF8 MainDefault

MainDefaultUTF8CompilationAndRun

It's doesn't seem to be a problem with the console (Git Bash on Windows 10), as it prints the characters normally

Echo

Thanks for your help

Upvotes: 9

Views: 11640

Answers (8)

ElpieKay
ElpieKay

Reputation: 30966

I encountered the same problem in git bash for Windows. java and javac cannot print Chinese characters properly. Setting git-bash's character set as UTF8 does not help. chcp does not work either. From git bash's installation wizard, I had known that programs like python do not work properly without winpty. I had added alias python='winpty python to ~/.bashrc. So I tried winpty java Foo.java and winpty javac Foo.java, and luckily the problem was gone. I added the aliases to ~/.bashrc to fix the problem:

alias java='winpty java'
alias javac='wintpy javac'

The recent versions(v2.2x) of git bash for Windows have included an experimental feature about winpty, but it seems it still has some problems, so I've kept these aliases so far.

Upvotes: 1

kriegaex
kriegaex

Reputation: 67477

The hex codes look okay for UTF-8. Maybe your character set for Git Bash is not UTF-8. For me it looks like this:

Text and font settings for mintty (Git Bash)

The console output then also looks fine:

Console output UTF-8


Update 2020-09-13: Here is proof that chcp.com <codepage> does not work in Git Bash (mintty). It has no effect whatsoever. You really do have to select the correct codepage in the mintty settings dialogue.

screen recording of Git Bash mintty


Update 2020-09-15: Okay, after I read @rmunge's answer I upgraded to Git 2.28 and could reproduce the OP's problem and also use the chcp workaround (it did not work as described by @rmunge in my case). Because Git (or MSYS2, respectively) are so buggy in the latest versions and I don't wish to use chcp.com from inside Git Bash every time I open a new console, I just downgraded to version 2.15.1 which I had used for 3 years without any problems before. Maybe there are later versions without the console bug, I did not try but just use my old installer from the downloads folder on my computer. I recommend everyone to do the same and now work around this ugly bug. With a non-buggy console version, it just works like I described.

Upvotes: 4

jccampanero
jccampanero

Reputation: 53461

Your code are not printing the right characters in the console because your Java program and the console are using different character sets, different encodings.

If you want to obtain the same characters, you first need to determine which character sets are in place.

This process will depend on the "console" in which you are outputting your results.

If you are working with Windows and cmd, as @RickJames suggested, you can use the chcp command to determine the active code page.

Oracle provides the Java full supported encodings information, and the correspondence with other alias - code pages in this case - in this page.

This stackoverflow answer also provides some guidance about the mapping between Windows Code Pages and Java charsets.

As you can see in the provided links, the code page for UTF-8 is 65001.

If you are using Git Bash (MinTTY), you can follow @kriegaex instructions to verify or configure UTF-8 as the terminal emulator encoding.

Linux and UNIX, or UNIX derived systems like Mac OS, do not use code page identifiers, but locales. The locale information can vary between systems, but you can either use the locale command or try to inspect the LC_* system variables to find the required information.

This is the output of the locale command in my system:

LANG="es_ES.UTF-8"
LC_COLLATE="es_ES.UTF-8"
LC_CTYPE="es_ES.UTF-8"
LC_MESSAGES="es_ES.UTF-8"
LC_MONETARY="es_ES.UTF-8"
LC_NUMERIC="es_ES.UTF-8"
LC_TIME="es_ES.UTF-8"
LC_ALL=

Once you know this information, you need to run your Java program with the file.encoding VM option corresponding to the right charset:

java -Dfile.encoding=UTF8 MainDefault

Some classes, like PrintStream or PrintWriter, allows you to indicate the Charset in which the information will be outputted.

The -encoding javac option only allows you to specify the character encoding used by source files.

If you are using Windows with Git Bash, consider also reading this @rmunge answer: it provides information about a possible bug in the tool that may be the reason for the problem and that prevents the terminal from running correctly out of the box without the need for manual encoding adjustments.

Upvotes: 13

rmunge
rmunge

Reputation: 4248

The short version:

The unexpected behavior is reproducible with the following setup:

  • Windows 10 with English, German or French language, or any other language that leads to ANSI and OEM codepages that encode ² and ³ differently

  • Git for Windows 2.27.0 (installed with default setting i.e. configured to use MinTTY and experimental support for pseudo consoles disabled)

  • Source code is stored in UTF-8 encoding

To get correct bahavior:

  • Either re-install Git for Windows 2.27.0 and enable experimental support for pseudo consoles on the last page of the installer or upgrade to latest 2.28 version

  • Compile your code with javac -encoding UTF8

  • Call java without overriding file.encoding

The medium version:

Git for Windows 2.27.0 uses a version of MSYS2 that does not set the code page for MinTTY by calling SetConsoleCP when support for pseudo consoles is disabled. The Java runtime determines the codepage for System.out by calling GetConsoleCP. Since no codepage is set when Java is executed within MinTTY terminal, the call fails and Java uses the charset returned by Charset.defaultCharset() as fallback. But in a Windows installation as describe above, Charset.defaultCharset() returns Cp-1252 while the default charset for consoles is Cp-850. The two codepages are not fully compatible. This leads to the strange output.

The long version:

Windows has two types of codepages: ANSI and OEM codepages. The first type is intended for UI applications that do not support Unicode and the later is used for console applications. Both types encode a single character in 1 Byte but they are not fully compatible.

Therefore on Windows Java has to deal with two charsets instead of one:

  • Charset.defaultCharset() returns the ANSI codepage (usually cp-1252). This charset is specified by the file.encoding system property. If not specified as VM argument, the java executable determines the ANSI codepage and adds the system property during initialization. String.getBytes() uses the charset returned by Charset.defaultCharset().
  • System.out uses the OEM codepage for consoles (usually cp-850). The java executable gets this codepage by calling the GetConsoleCP function and sets the it as value for the internal system properties, sun.stdout.encoding and sun.stdout.encoding. When the call to GetConsoleCP fails the charset returned by Charset.defaultCharset() is used. This only happens when the console in which java.exe is executed hasn't set the OEM codepage before, by calling SetConsoleCP

So what happens now in the setup mentioned above?

$ javac MainDefault.java
$ java MainDefault

enter image description here

The native call of GetConsoleCP fails due to the bug in MSYS2. Therefore System.out falls back to the charset returned by Charset.defaultCharset() which is cp-1252. But the OEM codepage of the console is cp-850. Therefore System.out.println("²³") produces unexpected output.

The source code is stored in UTF-8. Encoding "²³" in UTF-8 requires 4 Bytes. But due to the missing -encoding parameter javac assumes default encoding that uses one byte per character. Therefore it interprets the 4 Bytes as 4 characters. String.getBytes uses the 1-Byte, based ANSI code page, cp-1252 and therefore returns 4 bytes.

$ javac -encoding UTF8 MainDefault.java
$ java MainDefault

enter image description here

With the -encoding UTF8 parameter javac interprets the UTF-8 encoded source as UTF-8. So the 4 bytes of "²³" are correclty recognized as two characters. System.out encodes the two characters in cp-1252 which leads to 2 bytes. But since the console still uses cp-850 the output is still corrupted. String.getBytes encodes the wo characters also in cp-1252 which leads to 2 bytes.

$ java -Dfile.encoding=UTF8 MainDefault

enter image description here

The system property, file.encoding overrides the charset returned by Charset.defaultCharset() that is also used by String.getBytes(). The two characters which were first wrongly interpreted by javac as 4 characters in 8-Bit encoding are now correclty encoded in UTF-8 as two characters encoded in two bytes per character. This leads to 4 bytes. Since file.encoding does not have any effect on the charset that is used by System.out the 4 (and not 2, due the wrong interpretation of javac) characters are still encoded in cp-1252, the console still uses cp-850 and you get still a corrupted output.

enter image description here

Your console can print ²³ since the console's 8-Bit OEM code page (cp-850) supports both characters. But it encodes it slightly different than the ANSI code page cp-1252 that is used by System.out ;-)

Upvotes: 5

rmunge
rmunge

Reputation: 4248

Please verify that your Windows 10 installation does not have Unicode UTF-8 support enabled. You can see this option by going to Settings and then: All Settings -> Time & Language -> Language -> "Administrative Language Settings"

This is what it looks like - the feature should be unchecked.

enter image description here

Rationale:

"²³".getBytes() returns the encoding of the string, based on the detected default charset. On a Windows 10 system the default charset should usually be a 1-Byte based encoding, independent from whether you launch java.exe from a Windows console or from Git Bash. But your first screenshot shows a 4-Byte encoding that is actually UTF-8. So your JVM seems to detect UTF-8 as the wrong default charset that is incompatible with the codepage of your console.

Your console can print ²³ because both characters are supported by the used code page, but the encoding is based on one byte per character while UTF-8 encoding requires 2 Bytes for each of these two characters.

I have no simple explanation for your second screenshot but be aware that Git Bash is based on MSYS2 which again uses mintty terminal emulator. While MSYS2 uses UTF-8, and mintty also seems to support UTF-8 the whole thing is wrapped within a Windows console that is based on an OEM codepage that is incompatible to UTF-8. The whole thing then runs on an operating system that internally uses UTF-16. Now combined with a beta setting that overrules the whole OEM codebase concept on OS-level this setup provides enough complexity for some incomprehensible behavior.

Upvotes: 0

vvg
vvg

Reputation: 1213

On Windows, it has to do with your code page. You can use the command chcp to set the code page you want (for eg: if you want to set it up for a specific program launched) or you can specify the charset corresponding to the codepage in the java commanline.

If the current codepage does not support the characters you are printing, you will see garbage in the console.

The reason why different shells may behave differently is due to the codepage/charsets that are loaded by default.

Please check out this SO post for how it is done: System.out character encoding

Upvotes: 1

Tharindu Sathischandra
Tharindu Sathischandra

Reputation: 2004

I am also using the Git Bash on Windows 10 and It works totally fine for me.

Here's how it prints,

Trying to reproduce it in Git Bash on Windows 10

Terminal version is mintty 3.0.2 (x86_64-pc-msys) and My text properties were,

enter image description here

So, I tried to reproduce your outputs by changing Character Sets;

enter image description here

By setting Character Set to CP437 (OEM codepage) (Note that this automatically changed Locale to C too), I could be able to get the output as you got.

enter image description here

And then after when I change it back to UTF-8 (Unicode), the I could get the output as expected!

enter image description here

Therefore, it is clear that the problem is with your console's Character Set.

Upvotes: 5

Rick James
Rick James

Reputation: 142540

Hex C2B2 C2B3, when interpreted as UTF-8 is ²³.

I assume you are using a Windows "cmd terminal"?

The command "chcp" controls the "code page". chcp 65001 provides utf8, but it needs a special charset installed, too. To set the font in the console window: Right-click on the title of the window → Properties → Font → pick Lucida Console

Upvotes: 0

Related Questions