Philippe
Philippe

Reputation: 446

UTF-8 difference between oracle and java

I have the following unicode difference between an oracle database and Java.

If I run the following in oracle sql developer:

select unistr('\008C') from dual;

I get the following unicode character: http://www.utf8icons.com/character/140/control-character

However, if I try to perform the same type of unicode code to string conversion in java :

String s1 = new String("\u008C");

I get an empty char as a result.

I understand I could use the \u0152 character that displays the character I need in both java and oracle properly, but I would like to understand why I have this difference. I tried playing with my fonts but I did not get to any decent result. Thanks.

Upvotes: 3

Views: 1855

Answers (2)

Alastair McCormack
Alastair McCormack

Reputation: 27704

This makes no sense:

String s1 = new String("\u008C".getBytes(), "UTF-8");

If you're lucky, your default encoding will be UTF-8 and you'll get:

s1.equals("\u008C") == true

this is because .getBytes() will default to your system encoding. You're effectively, encoding to an unknown (but discoverable) encoding and decoding from UTF-8.

If you're unlucky, your default encoding will be something else and you'll have emojibaked your string.

If what you meant to say was:

 System.out.println( "\u008C" );

produces nothing, it's because 'PARTIAL LINE BACKWARD' is a control character. i.e it's non-printing. It should never be printed. It would seem that some UI's render this character as 'LATIN CAPITAL LIGATURE OE' (U+0152) automatically and is dependent on implementation.

For example, if I copy create an HTML document with Œ in it, it displays in Chrome as Œ. Copy this char into your clipboard and paste into into a document and save it as UTF-16 BE. Hex dump the file and you will see:

0000000 01 52 

The Unicode code point / UTF-16 encoding of 'LATIN CAPITAL LIGATURE OE'. Therefore, the Oracle SQL Developer tool is just deceiving/helping you by displaying 'LATIN CAPITAL LIGATURE OE' instead.

Upvotes: 2

krokodilko
krokodilko

Reputation: 36087

String.getBytes() converts a string into sequence of bytes using the platform default encoding. It is equivalent to :

String encoding = System.getProperty("file.encoding");
"\u008C".getBytes( encoding );

A result of this function is dependent on the encoding you have.

For example on my PC there is cp1250 codepage, and I get this result:

    System.out.println( System.getProperty("file.encoding") );
    byte b[] = "\u008C".getBytes();
    for( byte bb: b)System.out.format("%x\n", bb);
    -------
    Cp1250
    3f

As you see, the Œ character was converted to one byte: 3f, which in cp1250 is ? character. I belive this is because there is no Œ character in cp1250, so CharsetEncoder (which is used by toBytes() method to convert from unicode strings to a specufic charset) converts Œ to ? in this case.

See here for more details: http://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetEncoder.html

As you see, your java code is converting unicode string to your platform encoding, then the result (as byte array) is treated as unicode again - but in fact it is encoded using some other encoding.
This doesn't make sense.

Upvotes: 1

Related Questions