developer1
developer1

Reputation: 557

Skip BOM using BOMInputStream and retrieve byte[] without BOM

I have an xml file with BOM(UTF-8 encoding). The file comes as a byte[]. I need to skip the BOM and later convert these bytes into a String.

This is how my code looks like now:

BOMInputStream bomInputStream = new BOMInputStream(new ByteArrayInputStream(requestDTO.getFile())); // getFile() returns byte[]

bomInputStream.skip(bomInputStream.hasBOM() ? bomInputStream.getBOM().length() : 0);

validationService.validate(new String(/*BYTE[] WITHOUT BOM*/)); // throws NullPointerException

I'm using BOMInputStream. I have couple of issues. The first one is that the bomInputStream.hasBOM() returns false. The second one, I'm not sure how to retrive the byte[] from bomInputStream later on, because bomInputStream.getBOM().getBytes() throws NullPointerException. Thanks for any help!

BOMInputStream documentation link: https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/input/BOMInputStream.html

Upvotes: 0

Views: 1695

Answers (1)

Joop Eggen
Joop Eggen

Reputation: 109547

The constructor without boolean include parameter excludes the BOM, hence hasBOM() returns false, and no BOM will be included. And the String will not contain a BOM. Then getBOM() returns null!

byte[] xml = requestDTO.getFile();
int bomLength = 0;
Charset charset = StandardCharsets.UTF_8;
try (BOMInputStream bommedInputStream = new BOMInputStream(new ByteArrayInputStream(xml),
            true)) {
    if (bommedInputStream.hasBOM()) {
        bomLength = bommedInputStream.getBOM().length();
        charset = Charset.forName(bommedInputStream.getBOMCharsetName());
    } else {
        // Handle <?xml ... encoding="..." ... ?>.
        String t = new String(xml, StandardCharsets.ISO_8859_1));
        String enc = t.replace("(?sm).*<\\?xml.*\\bencoding=\"([^\"]+)\".*\\?>.*$", "$1");
        ... or such to fill charset ...
    }
}
String s = new String(xml, charset).replaceFirst("^\uFEFF", ""); // Remove BOM.
validationService.validate(s);

Removing the BOM could be done using the bomLength. BOMInputStream can give us the charset for the many UTF variants.

The String constructor without encoding/charset (as you used) will use the default platform encoding. As the BOM is Unicode code pointer U+FEFF, you can simply pass "\uFEFF".

Upvotes: 1

Related Questions