Reputation: 1230
I´m trying to parse one xml but it shows a error, if I put a system.out.println
to the String
i see it.
before
<?xml version="1.0"
after
?<?xml version="1.0"
I´m changing the charset to UTF-8
but didn´t works, so, what should I do?
Upvotes: 4
Views: 6601
Reputation: 1230
For someone who wants to parse a xml and is having some problem with parse because of BOM
this code above worked to me.
You can use API from apache BomInpustStream, it does the job for you, I had this problem, and you can trust, using this API will be much easier. A tip for you when parse a XML
, you will need to get this as a array of bytes
, check with the API suggested, and later parse to String
in the charset UTF-8
, in this way you will not lost the accents..
Piece of code to transform a source in inputStream
String source = FileUtil.takeOffBOM(IOUtils.toInputStream(attachment.getValue()));
Method to take off the BOM
public static String takeOffBOM(InputStream inputStream) throws IOException {
BOMInputStream bomInputStream = new BOMInputStream(inputStream);
return IOUtils.toString(bomInputStream, "UTF-8");
}
Upvotes: 3
Reputation: 1553
You have a UTF-8 string (which is why Notepad++ is recognizing it as such), but UTF-8 doesn't require a BOM. Some programs produce it; some don't. This leads to occasional confusion when reading files - some readers (like the one you're using in your Java code) don't recognize and ignore it. I'd recommend something like the accepted answer to this question or this one for removing it. Make sure you implement a check to determine if the first 3 bytes actually are a BOM before removing them from all incoming strings.
Upvotes: 4
Reputation: 1137
A lot of utilities produce such initial odd character.
You may use java code to skip any character before the first "<". If your xml file is yours, you can fix it for good with, for example:
vi # no filename here, we need first to get in binary mode.
:set binary
:e filename.containing.your.xml
dt<:w
:q!
Upvotes: 1