Reputation: 446
I have the following unicode difference between an oracle database and Java.
If I run the following in oracle sql developer:
select unistr('\008C') from dual;
I get the following unicode character: http://www.utf8icons.com/character/140/control-character
However, if I try to perform the same type of unicode code to string conversion in java :
String s1 = new String("\u008C");
I get an empty char as a result.
I understand I could use the \u0152 character that displays the character I need in both java and oracle properly, but I would like to understand why I have this difference. I tried playing with my fonts but I did not get to any decent result. Thanks.
Upvotes: 3
Views: 1855
Reputation: 27704
This makes no sense:
String s1 = new String("\u008C".getBytes(), "UTF-8");
If you're lucky, your default encoding will be UTF-8 and you'll get:
s1.equals("\u008C") == true
this is because .getBytes()
will default to your system encoding. You're effectively, encoding to an unknown (but discoverable) encoding and decoding from UTF-8.
If you're unlucky, your default encoding will be something else and you'll have emojibaked your string.
If what you meant to say was:
System.out.println( "\u008C" );
produces nothing, it's because 'PARTIAL LINE BACKWARD' is a control character. i.e it's non-printing. It should never be printed. It would seem that some UI's render this character as 'LATIN CAPITAL LIGATURE OE' (U+0152) automatically and is dependent on implementation.
For example, if I copy create an HTML document with Œ
in it, it displays in Chrome as . Copy this char into your clipboard and paste into into a document and save it as UTF-16 BE. Hex dump the file and you will see:
0000000 01 52
The Unicode code point / UTF-16 encoding of 'LATIN CAPITAL LIGATURE OE'. Therefore, the Oracle SQL Developer tool is just deceiving/helping you by displaying 'LATIN CAPITAL LIGATURE OE' instead.
Upvotes: 2
Reputation: 36087
String.getBytes()
converts a string into sequence of bytes using the platform default encoding. It is equivalent to :
String encoding = System.getProperty("file.encoding");
"\u008C".getBytes( encoding );
A result of this function is dependent on the encoding you have.
For example on my PC there is cp1250 codepage, and I get this result:
System.out.println( System.getProperty("file.encoding") );
byte b[] = "\u008C".getBytes();
for( byte bb: b)System.out.format("%x\n", bb);
-------
Cp1250
3f
As you see, the Œ character was converted to one byte: 3f, which in cp1250 is ? character. I belive this is because there is no Œ character in cp1250, so CharsetEncoder (which is used by toBytes() method to convert from unicode strings to a specufic charset) converts Œ to ? in this case.
See here for more details: http://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetEncoder.html
As you see, your java code is converting unicode string to your platform encoding, then the result (as byte array) is treated as unicode again - but in fact it is encoded using some other encoding.
This doesn't make sense.
Upvotes: 1