Reputation: 19490
I'd like to know how to let my code produce the same output (UTF-8 or UTF16) on different platforms (at least windows and linux).
I thought it was possible to set a codepage to use by the application but I can't find the information to set a codepage. And I don't know if setting a codepage would really produce the same output when using special characters like äöü or other non latin characters.
I'd like to have a solution that works without setting arguments for java.exe.
Edit:
I mean the output to a console. A comment about possible effects on other output media would be nice.
Upvotes: 2
Views: 1733
Reputation: 108949
The Java char
type uses UTF-16 which is capable of representing every code point in the Unicode character set. Pretty much all I/O where strings are used involves some implicit transcoding operation.
To save and restore character data without loss or corruption it is generally best to use one of the Unicode transformation formats. There are reader and writer types that can be used to perform this transcoding operation. Avoid the default constructors as they rely on the default encoding which can be a legacy encoding best consigned to decades past. Explicitly specifying UTF-8 is generally preferred.
There are different issues with writing to the terminal. Here you are writing data that will be decoded by another application so you must write character data in a format it understands.
The Console
type will detect and use the terminal's encoding whereas System.out
uses the default platform encoding - these are different on Windows for a bunch of historical reasons. The other differences are noted here. The documented way to use Unicode in cmd.exe is to use the appropriate Win32 API calls.
Some relevant posts from my blog:
BalusC also has a good post on some of the practical issues of character handling: Unicode - How to get the characters right?
Upvotes: 1
Reputation: 70574
A charset (or codepage, as it used to be called) converts a sequence of characters into a sequence of bytes.
In the Java API, charsets are implemented as subclasses of Charset
. All API elements that convert between characters and bytes can be provided with the charset to use (many also allow you to pass the charset name instead, so you don't have to do the lookup yourself). If you do not provide a charset, those methods usually fall back to the operating system's default encoding.
For instance, OutputStreamWriter
features a constructor that takes a charset:
try (Writer w = new OutputStreamWriter(System.out, "utf-8")) {
w.write("Hello world");
}
Upvotes: 1