Reputation: 1387
Noticed this behavior while troubleshooting a file generation issue in a piece of java code that's moved from AIX to LINUX sever.
Charset.defaultCharset();
returns ISO-8859-1
on AIX, UTF-8
on Linux, and windows-1252
on my Windows 7. With that said, I am trying to figure out why on the Linux machine, nlength
= 24 (3 bytes per alphanumeric character) whereas on AIX and Windows it is 8.
String inString = "ABC12345";
byte[] ebcdicByte = new byte[inString.length()];
System.out.println("Length:"+inString.getBytes("Cp1047").length);
ebcdicByte = inString.getBytes("Cp1047").);
String ebcdicString = new String( ebcdicByte);
int nlength = ebcdicString.getBytes().length;
Upvotes: 0
Views: 1266
Reputation: 44335
Building on fge's answer...
Your observation is occurring because new String(ebcdicByte)
and ebcdicString.getBytes()
use the platform's default charset.
ISO-8859-1 and windows-1252 are one-byte charsets. In those charsets, one byte always represents one character. So in AIX and Windows, when you do new String(ebcdicByte)
, you will always get a String whose character count is identical to your byte array's length. Similarly, converting a String back to bytes will use a one-to-one mapping.
But in UTF-8, one character does not necessarily correspond to one byte. In UTF-8, bytes 0 through 127 are single-byte representations of characters, but all other values are part of a multi-byte sequence.
However, not just any sequence of bytes with their high bit set is a valid UTF-8 sequence. If you give an UTF-8 decoder a sequence of bytes that isn't a properly encoded UTF-8 byte sequence, it is considered malformed. new String
will simply map malformed sequences to a special default character, usually "�" ('\ufffd'
). That behavior can be changed by explicitly creating your own CharsetDecoder and calling its onMalformedInput method, rather than just relying on new String(byte[])
.
So, the ebcdicByte
array contains this EBCDIC representation of "ABC12345":
C1 C2 C3 F1 F2 F3 F4 F5
None of those are valid UTF-8 byte sequences, so ebcdicString
ends up as "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd"
which is "��������".
Your last line of code calls ebcdicString.getBytes()
, which again does not specify a character set, which means the default charset will be used. Using UTF-8, "�" gets encoded as three bytes, EF BF BD. Since there are eight of those in ebcdicString
, you get 3×8=24 bytes.
Upvotes: 2
Reputation: 2776
You have to specify the charset in the second to last line.
String ebcdicString = new String( ebcdicByte,"Cp1047");
as already pointed out, you always have to specify the charset when encoding/decoding.
Upvotes: 1
Reputation: 121730
You are misunterstanding things.
This is Java.
There are byte
s. There are char
s. And there is the default encoding.
When translating from byte
s to char
s, you have to decode.
When translating from char
s to byte
s, you have to encode.
And of course, apart from very limited charsets you will never have a 1-1 char-byte mapping.
If you see problems with encoding/decoding, the cause is pretty simple: somewhere in your code (with luck, in only one place; if not lucky, in several places) you failed to specify the charset to use when decoding and encoding.
Also note that by default, the encoding/decoding behaviour on failure it to replace unmappable char
/byte
sequences.
All this to say: a String
does not have an encoding. Sure, it is a series of char
s and a char
is a primitive type; but it could just as well have been a stream of carrier pigeons, the two basic processes remain the same: you need to decode from bytes and you need to encode to bytes; if either part fails you end with meaningless byte sequences/mutant carrier pigeons.
Upvotes: 4