Reputation: 4165
I faced in a problem with unicode characters serialization and deserialization. Here is a sample program that writes a char to file and then tries to read it. Written and read chars (ch and ch2) are different. Any suggestions why I get this behavior?
public class MainClass {
public static void main(String[] args) {
try {
File outfile = new File("test.txt");
FileOutputStream fos = new FileOutputStream(outfile);
OutputStreamWriter writer = new OutputStreamWriter(fos, "UTF-16");
FileInputStream fis = new FileInputStream(outfile);
InputStreamReader reader = new InputStreamReader(fis, "UTF-16");
char ch = 56000;
System.out.println(Integer.toBinaryString(ch));
writer.write(ch);
writer.close();
char ch2 = (char) reader.read();
System.out.println(Integer.toBinaryString(ch2));
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
UPD: Empirically found that this happens only for numbers from range 55296-57343.
Upvotes: 4
Views: 174
Reputation: 1109512
Character 56000 is U+DAC0 which is not a valid unicode character, it's a high surrogate character. They are to be used in a pair to point characters outside the 16 bit wide BMP.
Upvotes: 6