Blubb Blubbington
Blubb Blubbington

Reputation: 1

Storing plain text and byte information in the same file - Conversion problems

I am supposed to develop a subsystem to store certain business data in a file and I am running into a problem, but first some requirements I have:

I thought I just put everything in a String, encode it with UTF8 (a format that will not go away any time soon) and write it to a file. Problem is, UTF8 does not allow certain byte combinations and changes them when I later read the file again.

Here is a sample code that shows the problem:

    // The charset we use to encode the strings / file
    Charset charSet = StandardCharsets.UTF_8;

    // The byte data we want to store (as ints here because in the app it is used as ints)
    int idsToStore[] = new int[] {360, 361, 390, 391};

    // We transform our ints to bytes
    byte[] bytesToStore = new byte[idsToStore.length * 4];
    for (int i = 0; i < idsToStore.length; i++) {
        int id = idsToStore[i];
        bytesToStore[i * 4 + 0] = (byte) ((id >> 24) & 0xFF);
        bytesToStore[i * 4 + 1] = (byte) ((id >> 16) & 0xFF);
        bytesToStore[i * 4 + 2] = (byte) ((id >> 8) & 0xFF);
        bytesToStore[i * 4 + 3] = (byte) (id & 0xFF);
    }
    // We transform our bytes to a String
    String stringToStore = new String(bytesToStore, charSet);

    System.out.println("idsToStore="+Arrays.toString(idsToStore));
    System.out.println("BytesToStore="+Arrays.toString(bytesToStore));
    System.out.println("StringToStore="+stringToStore);
    System.out.println();

    // We load our bytes from the "file" (in this case a String, but its the same result)
    byte[] bytesLoaded = stringToStore.getBytes(charSet);
    // Just to check we see if the resulting String is identical
    String stringLoaded = new String(bytesLoaded, charSet);

    // We transform our bytes back to ints
    int[] idsLoaded = new int[bytesLoaded.length / 4];
    int readPos = 0;
    for (int i = 0; i < idsLoaded.length; i++) {
        byte b1 = bytesLoaded[readPos++];
        byte b2 = bytesLoaded[readPos++];
        byte b3 = bytesLoaded[readPos++];
        byte b4 = bytesLoaded[readPos++];
        idsLoaded[i] = (b4 & 0xFF) | (b3 & 0xFF) << 8 | (b2 & 0xFF) << 16 | (b1 & 0xFF) << 24;
    }

    System.out.println("BytesLoaded="+Arrays.toString(bytesLoaded));
    System.out.println("StringLoaded="+stringLoaded);
    System.out.println("idsLoaded="+Arrays.toString(idsLoaded));
    System.out.println();

    // We check everything
    System.out.println("Bytes equal: "+Arrays.equals(bytesToStore, bytesLoaded));
    System.out.println("Strings equal: "+stringToStore.equals(stringLoaded));
    System.out.println("IDs equal: "+Arrays.equals(idsToStore, idsLoaded));

The output with UTF8 is:

    idsToStore=[360, 361, 390, 391]
    BytesToStore=[0, 0, 1, 104, 0, 0, 1, 105, 0, 0, 1, -122, 0, 0, 1, -121]
    StringToStore=(can not be pasted into SO)

    idsLoaded=[360, 361, 495, -1078132736, 32489405]
    BytesLoaded=[0, 0, 1, 104, 0, 0, 1, 105, 0, 0, 1, -17, -65, -67, 0, 0, 1, -17, -65, -67]
    StringLoaded=(can not be pasted into SO)

    Bytes equal: false
    Strings equal: true
    IDs equal: false

If I change the Charset to UTF16BE (<- BE is Big Endian) this test works! The problem is, I am not sure if UTF16BE just works for this test "by chance". I need to know whether it will always work or not. Or perhaps there is a better way.

I am thankful for any recommendations. Thanks in advance.

Upvotes: 0

Views: 139

Answers (1)

Little Santi
Little Santi

Reputation: 8803

The only way to ensure if your charset will always work is to test it with the entire ASCII table: Write an array of bytes containing all the 256 possible values, and test if it was correctly read.

But, going back to the root of the problem, I doubt that coding all the data into a string will work well. String is an Unicode structure, oriented to contain readable text (i.e. it might not contain some characters under the 32 ascii code).

Instead, I would think of a BINARY structured file: Being binary, you ensure that it can contain anything transparently. And being sutructured, you ensure that you can store several kind of data on it. For example, it would be fine if you could design a structure made of segments, and each segment having a header for the length of its data. The binary segments would be read through an InputStream, and the text segments through a Reader (with the desired encoding).

Upvotes: 2

Related Questions