Brian
Brian

Reputation: 1387

Platform Dependent Encoding issues in Java

Noticed this behavior while troubleshooting a file generation issue in a piece of java code that's moved from AIX to LINUX sever.

Charset.defaultCharset();

returns ISO-8859-1 on AIX, UTF-8 on Linux, and windows-1252 on my Windows 7. With that said, I am trying to figure out why on the Linux machine, nlength = 24 (3 bytes per alphanumeric character) whereas on AIX and Windows it is 8.

    String inString = "ABC12345";
    byte[] ebcdicByte = new byte[inString.length()];
    System.out.println("Length:"+inString.getBytes("Cp1047").length);
    ebcdicByte = inString.getBytes("Cp1047").);
    String ebcdicString = new String( ebcdicByte);
    int nlength = ebcdicString.getBytes().length;

Upvotes: 0

Views: 1266

Answers (3)

VGR
VGR

Reputation: 44335

Building on fge's answer...

Your observation is occurring because new String(ebcdicByte) and ebcdicString.getBytes() use the platform's default charset.

ISO-8859-1 and windows-1252 are one-byte charsets. In those charsets, one byte always represents one character. So in AIX and Windows, when you do new String(ebcdicByte), you will always get a String whose character count is identical to your byte array's length. Similarly, converting a String back to bytes will use a one-to-one mapping.

But in UTF-8, one character does not necessarily correspond to one byte. In UTF-8, bytes 0 through 127 are single-byte representations of characters, but all other values are part of a multi-byte sequence.

However, not just any sequence of bytes with their high bit set is a valid UTF-8 sequence. If you give an UTF-8 decoder a sequence of bytes that isn't a properly encoded UTF-8 byte sequence, it is considered malformed. new String will simply map malformed sequences to a special default character, usually "�" ('\ufffd'). That behavior can be changed by explicitly creating your own CharsetDecoder and calling its onMalformedInput method, rather than just relying on new String(byte[]).

So, the ebcdicByte array contains this EBCDIC representation of "ABC12345":

C1 C2 C3 F1 F2 F3 F4 F5

None of those are valid UTF-8 byte sequences, so ebcdicString ends up as "\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd" which is "��������".

Your last line of code calls ebcdicString.getBytes(), which again does not specify a character set, which means the default charset will be used. Using UTF-8, "�" gets encoded as three bytes, EF BF BD. Since there are eight of those in ebcdicString, you get 3×8=24 bytes.

Upvotes: 2

wgitscht
wgitscht

Reputation: 2776

You have to specify the charset in the second to last line.

 String ebcdicString = new String( ebcdicByte,"Cp1047");

as already pointed out, you always have to specify the charset when encoding/decoding.

Upvotes: 1

fge
fge

Reputation: 121730

You are misunterstanding things.

This is Java.

There are bytes. There are chars. And there is the default encoding.

When translating from bytes to chars, you have to decode.

When translating from chars to bytes, you have to encode.

And of course, apart from very limited charsets you will never have a 1-1 char-byte mapping.

If you see problems with encoding/decoding, the cause is pretty simple: somewhere in your code (with luck, in only one place; if not lucky, in several places) you failed to specify the charset to use when decoding and encoding.

Also note that by default, the encoding/decoding behaviour on failure it to replace unmappable char/byte sequences.

All this to say: a String does not have an encoding. Sure, it is a series of chars and a char is a primitive type; but it could just as well have been a stream of carrier pigeons, the two basic processes remain the same: you need to decode from bytes and you need to encode to bytes; if either part fails you end with meaningless byte sequences/mutant carrier pigeons.

Upvotes: 4

Related Questions