Reputation: 1
I am supposed to develop a subsystem to store certain business data in a file and I am running into a problem, but first some requirements I have:
I thought I just put everything in a String, encode it with UTF8 (a format that will not go away any time soon) and write it to a file. Problem is, UTF8 does not allow certain byte combinations and changes them when I later read the file again.
Here is a sample code that shows the problem:
// The charset we use to encode the strings / file
Charset charSet = StandardCharsets.UTF_8;
// The byte data we want to store (as ints here because in the app it is used as ints)
int idsToStore[] = new int[] {360, 361, 390, 391};
// We transform our ints to bytes
byte[] bytesToStore = new byte[idsToStore.length * 4];
for (int i = 0; i < idsToStore.length; i++) {
int id = idsToStore[i];
bytesToStore[i * 4 + 0] = (byte) ((id >> 24) & 0xFF);
bytesToStore[i * 4 + 1] = (byte) ((id >> 16) & 0xFF);
bytesToStore[i * 4 + 2] = (byte) ((id >> 8) & 0xFF);
bytesToStore[i * 4 + 3] = (byte) (id & 0xFF);
}
// We transform our bytes to a String
String stringToStore = new String(bytesToStore, charSet);
System.out.println("idsToStore="+Arrays.toString(idsToStore));
System.out.println("BytesToStore="+Arrays.toString(bytesToStore));
System.out.println("StringToStore="+stringToStore);
System.out.println();
// We load our bytes from the "file" (in this case a String, but its the same result)
byte[] bytesLoaded = stringToStore.getBytes(charSet);
// Just to check we see if the resulting String is identical
String stringLoaded = new String(bytesLoaded, charSet);
// We transform our bytes back to ints
int[] idsLoaded = new int[bytesLoaded.length / 4];
int readPos = 0;
for (int i = 0; i < idsLoaded.length; i++) {
byte b1 = bytesLoaded[readPos++];
byte b2 = bytesLoaded[readPos++];
byte b3 = bytesLoaded[readPos++];
byte b4 = bytesLoaded[readPos++];
idsLoaded[i] = (b4 & 0xFF) | (b3 & 0xFF) << 8 | (b2 & 0xFF) << 16 | (b1 & 0xFF) << 24;
}
System.out.println("BytesLoaded="+Arrays.toString(bytesLoaded));
System.out.println("StringLoaded="+stringLoaded);
System.out.println("idsLoaded="+Arrays.toString(idsLoaded));
System.out.println();
// We check everything
System.out.println("Bytes equal: "+Arrays.equals(bytesToStore, bytesLoaded));
System.out.println("Strings equal: "+stringToStore.equals(stringLoaded));
System.out.println("IDs equal: "+Arrays.equals(idsToStore, idsLoaded));
The output with UTF8 is:
idsToStore=[360, 361, 390, 391]
BytesToStore=[0, 0, 1, 104, 0, 0, 1, 105, 0, 0, 1, -122, 0, 0, 1, -121]
StringToStore=(can not be pasted into SO)
idsLoaded=[360, 361, 495, -1078132736, 32489405]
BytesLoaded=[0, 0, 1, 104, 0, 0, 1, 105, 0, 0, 1, -17, -65, -67, 0, 0, 1, -17, -65, -67]
StringLoaded=(can not be pasted into SO)
Bytes equal: false
Strings equal: true
IDs equal: false
If I change the Charset to UTF16BE (<- BE is Big Endian) this test works! The problem is, I am not sure if UTF16BE just works for this test "by chance". I need to know whether it will always work or not. Or perhaps there is a better way.
I am thankful for any recommendations. Thanks in advance.
Upvotes: 0
Views: 139
Reputation: 8803
The only way to ensure if your charset will always work is to test it with the entire ASCII table: Write an array of bytes containing all the 256 possible values, and test if it was correctly read.
But, going back to the root of the problem, I doubt that coding all the data into a string will work well. String is an Unicode structure, oriented to contain readable text (i.e. it might not contain some characters under the 32 ascii code).
Instead, I would think of a BINARY structured file: Being binary, you ensure that it can contain anything transparently. And being sutructured, you ensure that you can store several kind of data on it. For example, it would be fine if you could design a structure made of segments, and each segment having a header for the length of its data. The binary segments would be read through an InputStream, and the text segments through a Reader (with the desired encoding).
Upvotes: 2