kujawk
kujawk

Reputation: 857

Java: Converting UTF 8 to String

When I run the following program:

public static void main(String args[]) throws Exception
{
    byte str[] = {(byte)0xEC, (byte)0x96, (byte)0xB4};
    String s = new String(str, "UTF-8");
}

on Linux and inspect the value of s in jdb, I correctly get:

 s = "ì–´"

on Windows, I incorrectly get:

s = "?"

My byte sequence is a valid UTF-8 character in Korean, why would it be producing two very different results?

Upvotes: 2

Views: 4463

Answers (4)

Dan Bliss
Dan Bliss

Reputation: 1744

JDB is displaying the data incorrectly. The code works the same on both Windows and Linux. Try running this more definitive test:

public static void main(String[] args) throws Exception {
    byte str[] = {(byte)0xEC, (byte)0x96, (byte)0xB4};
    String s = new String(str, "UTF-8"); 
    for(int i=0; i<s.length(); i++) {
        System.out.println(BigInteger.valueOf((int)s.charAt(i)).toString(16));
    }
}

This prints out the hex value of every character in the string. This will correctly print out "c5b4" in both Windows and Linux.

Upvotes: 0

Tomasz Nurkiewicz
Tomasz Nurkiewicz

Reputation: 340933

It correctly prints "" on my computer (Ubuntu Linux), as described in Code Table Korean Hangul. Windows command prompt is known to have issues with encoding, don't bother.

Your code is fine.

Upvotes: 3

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726987

You get the correct string, it's Windows console that does not display the string correctly.

Here is a link to an article that discusses a way to make Java console produce correct Unicode output using JNI.

Upvotes: 1

Bozho
Bozho

Reputation: 597342

It gives for me. This means your console is probably not configured to display UTF-8 and it is a printing/display problem, rather than a problem with representation.

Upvotes: 1

Related Questions