Reputation: 269
I have a problem in Apache POI. The problem is, I try to put a 16 bits character value (such as CJK Unified Ideographs Extension B) to .xlsx file. However, the cell value become a question mark(like ????) in generated .xlsx file.
Anyone know how to handle the 16 bits character value in Apache POI with .xlsx format???
My POI version is 3.14
Code sample as below:
XSSFWorkbook workbook = new XSSFWorkbook();
XSSFSheet sheet = workbook.createSheet("Test");
XSSFRow row1 = sheet.createRow(0);
XSSFCell r1c1 = row1.createCell(0);
r1c1.setCellValue("𤆕𤆕𤆕"); // value of CJK Unified Ideographs Extension B
XSSFCell r1c2 = row1.createCell(1);
FileOutputStream fos =new FileOutputStream("D:/temp/test.xlsx");
workbook.write(fos);
fos.close();
Thanks!
Upvotes: 3
Views: 3029
Reputation: 61915
The problem exists. But not with 16 bit (2 byte) Unicode characters from 0x0000
to 0xFFFF
. It is with characters which needs more than 2 byte in Unicode encoding. Those are the characters which where mentioned as Unicode code points
in Java Character: "Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding." The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters (Characters whose code points are greater than U+FFFF) are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).
The problem is with org.apache.xmlbeans.impl.store.Saver
. This works with a private char[] _buf
. But since char
max value is 0xFFFF
, Unicode codepoints from 0x10000
to 0x10FFFF
are not possible to store in char
. So the will be stored as a pair of char values.
There is a method
/**
* Test if a character is valid in xml character content. See
* http://www.w3.org/TR/REC-xml#NT-Char
*/
private boolean isBadChar ( char ch )
{
return ! (
(ch >= 0x20 && ch <= 0xD7FF ) ||
(ch >= 0xE000 && ch <= 0xFFFD) ||
(ch >= 0x10000 && ch <= 0x10FFFF) ||
(ch == 0x9) || (ch == 0xA) || (ch == 0xD)
);
}
That code is totally buggy since it checks if a char
is between 0x10000
and 0x10FFFF
. As mentioned this is not possible at all.
Also it excludes the high-surrogates range, (\uD800-\uDBFF) and the low-surrogates range (\uDC00-\uDFFF) as bad chars. So the code point representations as a pair of char values will be excluded.
So the problem results from a bug in org.apache.xmlbeans.impl.store.Saver
.
Patch:
Goal: Not exclude the high-surrogates range, (\uD800-\uDBFF), and the low-surrogates range, (\uDC00-\uDFFF), as bad chars. So Unicode code points above U+10000, stored as two 16 bit chars
will not be excluded in XML
.
Download Saver.java. Change the private boolean isBadChar ( char ch )
to
/**
* Test if a character is valid in xml character content. See
* http://www.w3.org/TR/REC-xml#NT-Char
*/
private boolean isBadChar ( char ch )
{
return ! (
(ch >= 0x20 && ch <= 0xFFFD ) ||
(ch == 0x9) || (ch == 0xA) || (ch == 0xD)
);
}
in both static final class OptimizedForSpeedSaver
and static final class TextSaver
.
Compile Saver.java
.
Store a backup of xmlbeans-2.6.0.jar
somewhere outside the classpath.
Replace Saver$OptimizedForSpeedSaver.class
and Saver$TextSaver.class
in xmlbeans-2.6.0.jar
-> /org/apache/xmlbeans/impl/store/
with the new compiiled ones.
Now Unicode code points above U+10000 will be stored in sharedStrings.xml
.
Disclaimer:
This is not well tested. So don't use this in productive. It is only shown here to describe the problem. Maybe some programmers on xmlbeans.apache.org
will find the time to solve the problem with org.apache.xmlbeans.impl.store.Saver
properly.
Update There is a xmlbeans-2.6.2.jar available now. This contains the patch already.
Update There is a xmlbeans-3.0.0.jar available now. This also contains the patch already.
It does:
/**
* Test if a character is valid in xml character content. See
* http://www.w3.org/TR/REC-xml#NT-Char
*/
static boolean isBadChar ( char ch )
{
return ! (
Character.isHighSurrogate(ch) ||
Character.isLowSurrogate(ch) ||
(ch >= 0x20 && ch <= 0xD7FF ) ||
(ch >= 0xE000 && ch <= 0xFFFD) ||
(ch >= 0x10000 && ch <= 0x10FFFF) ||
(ch == 0x9) || (ch == 0xA) || (ch == 0xD)
);
}
So it checks whether char ch
is HighSurrogate
or LowSurrogate
and if so it is not a bad char. OK.
But nevertheless it checks whether char ch
is greater than or equal 0x10000
. Again: This is not possible for a char
! The max value of a char
is 0xFFFF
.
Upvotes: 3