Reputation: 557
I have an xml file with BOM(UTF-8 encoding). The file comes as a byte[]
. I need to skip the BOM and later convert these bytes into a String.
This is how my code looks like now:
BOMInputStream bomInputStream = new BOMInputStream(new ByteArrayInputStream(requestDTO.getFile())); // getFile() returns byte[]
bomInputStream.skip(bomInputStream.hasBOM() ? bomInputStream.getBOM().length() : 0);
validationService.validate(new String(/*BYTE[] WITHOUT BOM*/)); // throws NullPointerException
I'm using BOMInputStream. I have couple of issues. The first one is that the bomInputStream.hasBOM()
returns false
. The second one, I'm not sure how to retrive the byte[]
from bomInputStream
later on, because bomInputStream.getBOM().getBytes()
throws NullPointerException. Thanks for any help!
BOMInputStream documentation link: https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/input/BOMInputStream.html
Upvotes: 0
Views: 1695
Reputation: 109547
The constructor without boolean include parameter excludes the BOM, hence hasBOM()
returns false, and no BOM will be included. And the String will not contain a BOM.
Then getBOM()
returns null!
byte[] xml = requestDTO.getFile();
int bomLength = 0;
Charset charset = StandardCharsets.UTF_8;
try (BOMInputStream bommedInputStream = new BOMInputStream(new ByteArrayInputStream(xml),
true)) {
if (bommedInputStream.hasBOM()) {
bomLength = bommedInputStream.getBOM().length();
charset = Charset.forName(bommedInputStream.getBOMCharsetName());
} else {
// Handle <?xml ... encoding="..." ... ?>.
String t = new String(xml, StandardCharsets.ISO_8859_1));
String enc = t.replace("(?sm).*<\\?xml.*\\bencoding=\"([^\"]+)\".*\\?>.*$", "$1");
... or such to fill charset ...
}
}
String s = new String(xml, charset).replaceFirst("^\uFEFF", ""); // Remove BOM.
validationService.validate(s);
Removing the BOM could be done using the bomLength. BOMInputStream can give us the charset for the many UTF variants.
The String constructor without encoding/charset (as you used) will use the default platform encoding. As the BOM is Unicode code pointer U+FEFF, you can simply pass "\uFEFF"
.
Upvotes: 1