Reputation: 1
I've xml file, which looks like this:
<Header>
<Type>TestType</Type>
<Owner>Me</Owner>
</Header>
ĺß™¸Ű;?źÉćáţ¬=ńgăűßEŶáCórýjąŞŢđ·I_§Ä†ÉD¤ďsĂŢŘö¤xi¦Ö†5ÚPMáx^š‡âő
Those funny letters are binary coded data.
I've a trouble with parsing it. All I want to do is read values of Type and Owner nodes and data after Header. That data can be big. It's basically xml with data appended after it. Header always starts with and ends with . Number of child nodes in it can change
I tried just simple parsing:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(f);
and what I got was:
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
Upvotes: 0
Views: 118
Reputation: 13008
You could try a SAX parser instead which does not read in the whole document. Just read in elements/attributes until you have what you want, then stop.
But this is not a well formed XML file. If possible, fix it by putting the (encoded) binary data into its own element.
Upvotes: 0
Reputation: 78011
In order to be processed by an XML parser a file must be well formed and optionally valid (The latter requires testing against a "schema" describing the expected tag format).
In this case your document is not well formed:
$ xmllint --noout File1.xml
File1.xml:5: parser error : Extra content at the end of the document
ĺß™¸Ű;?źÉćáţ¬=ńgăűßEŶáCórýjąŞŢđ·I_§Ä†ÉD¤ďsĂ
^
I would suggest finding some way to strip away the offending characters and then process the properly formatted XML. For example assuming the XML is in the first 4 files of the file:
head -n 4 File1.xml | xmllint --noout -
Upvotes: 2