Jean-Michel Garcia
Jean-Michel Garcia

Reputation: 2389

Encoding strangeness with Cp500 (LF & NEL)

Recently I had a strange issue with the Cp500 (EBCDIC) encoding during a transformation from bytes to String and then back from String to bytes.

The issue is that one specific character LINE FEED - LF - 0x25 is, during this transformation, being transformed to this character NEW LINE - NEL - 0x15.

Here the following code that validates this :

byte[] b25 = { 0x25 };
byte[] b4E = { 0x4E };

System.out.printf("\n0x25 in hex : <0x%02X>", b25[0]);
System.out.printf("\n0x4E in hex : <0x%02X>", b4E[0]);

String stringB25 = new String(b25, "Cp500");
String stringB4E = new String(b4E, "Cp500");

System.out.printf("\nOther way, 0x25 in hex : <0x%02X>", stringB25.getBytes("Cp500")[0]);
System.out.printf("\nOther way, 0x4E in hex : <0x%02X>", stringB4E.getBytes("Cp500")[0]);

Output :

0x25 in hex : <0x25>
0x4E in hex : <0x4E>
Other way, 0x25 in hex : <0x15>
Other way, 0x4E in hex : <0x4E>

In order to understand this behavior, I gave a look into the IBM500.java class, and I see that both 0x15 and 0x25 characters maps to the "\n" character.

What's the reason behind that?

Ultimately, is there a way to preserve the bytes input consistency between String encoding and decoding mechanism?

Upvotes: 4

Views: 5061

Answers (1)

McDowell
McDowell

Reputation: 108949

Consider this code:

  public static void main(String[] args) {
    transcode();
    System.setProperty("ibm.swapLF", "true");
    transcode();
  }

  private static void transcode() {
    byte EBCDIC_NL = 0x15; //next line
    byte EBCDIC_LF = 0x25; //line feed
    byte EBCDIC_CR = 0x0D; //carriage return

    ebcdicToUtf16(EBCDIC_NL);
    ebcdicToUtf16(EBCDIC_LF);
    ebcdicToUtf16(EBCDIC_CR);

    utf16ToEbcdic("\u0085"); //next line
    utf16ToEbcdic("\n"); //line feed
    utf16ToEbcdic("\r"); //carriage return
  }

  private static void ebcdicToUtf16(byte... b) {
    String utf16 = new String(b, Charset.forName("IBM500"));
    System.out.format("%02x -> %04x%n", b[0] & 0xFF, utf16.charAt(0) & 0xFFFF);
  }

  private static void utf16ToEbcdic(String s) {
    byte[] b = s.getBytes(Charset.forName("IBM500"));
    System.out.format("%04x -> %02x%n", s.charAt(0) & 0xFFFF, b[0] & 0xFF);
  }

When run on an IBM JVM (1.7) this will emit:

15 -> 000a
25 -> 000a
0d -> 000d
0085 -> 15
000a -> 15
000d -> 0d
15 -> 000a
25 -> 000a
0d -> 000d
0085 -> 15
000a -> 25
000d -> 0d

This IBM JVM patch SI23602 explains:

ADDITIONAL BACKGROUND: There are two standards in the industry for EBCDIC handling of the newline function. The two standards are to use the LF (0x25) CDRA or NL (0x15) MVS open edition. Early versions of Java (up through JDK 1.3) are inconsistent in their use of the newline function with most EBCDIC encodings using the 0x15, while some others used the 0x25. IBM JDKs, beginning in JDK 1.4, have chosen to standardize all EBCDIC character encodings on the use of NL (0x15).

To address the dual standard being used for the newline function, this APAR will provide a switch that allows certain EBCDIC converters to swap between use of 0x15 or 0x25 as the newline function. The default behavior for all EBCDIC character encodings will remain to map the unicode \u000A character to EBCDIC 0x15 character. Specifying the java property "ibm.swapLF=true" will cause the converters to switch its mapping of unicode \u000A to EBCDIC 0x25. The converters which support this java property as a switch are: Cp284, Cp285, Cp500, Cp1140, Cp1141, Cp1142, Cp1143, Cp1144, Cp1145, Cp1146, Cp1147, Cp1148, Cp1149.

Neither setting will map anything to U+0085 (the assigned Unicode value for NL/NEL). Presumably this is for historical reasons - ASCII does not have a NEL character and EBCDIC-to-ASCII must have been relatively common.

It would be possible to implement a loss-less roundtrip to a Unicode encoding but it is unlikely that the commonly available encoders will do this.

Notes:

Upvotes: 4

Related Questions