Reputation: 1

Java parsing xml file with appended data

I've xml file, which looks like this:

<Header>
  <Type>TestType</Type>
  <Owner>Me</Owner>
</Header>
ĺß™¸Ű;?źÉćĂˇţ¬=ńgăűßEĹ¶áCórýjąŞŢđ·I_§Ä†ÉD¤ďsĂŢŘö¤xi¦Ö†5ÚPMáx^š‡âő

Those funny letters are binary coded data.

I've a trouble with parsing it. All I want to do is read values of Type and Owner nodes and data after Header. That data can be big. It's basically xml with data appended after it. Header always starts with and ends with . Number of child nodes in it can change

I tried just simple parsing:

DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(f);

and what I got was:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.

Upvotes: 0

Answers (2)

Beryllium

Reputation: 13008

You could try a SAX parser instead which does not read in the whole document. Just read in elements/attributes until you have what you want, then stop.

But this is not a well formed XML file. If possible, fix it by putting the (encoded) binary data into its own element.

Upvotes: 0

Mark O'Connor

Reputation: 78011

In order to be processed by an XML parser a file must be well formed and optionally valid (The latter requires testing against a "schema" describing the expected tag format).

In this case your document is not well formed:

$ xmllint --noout File1.xml
File1.xml:5: parser error : Extra content at the end of the document
ĺß™¸Ű;?źÉćĂˇţ¬=ńgăűßEĹ¶áCórýjąŞŢđ·I_§Ä†ÉD¤ďsĂ
^

I would suggest finding some way to strip away the offending characters and then process the properly formatted XML. For example assuming the XML is in the first 4 files of the file:

head -n 4 File1.xml | xmllint --noout -

Upvotes: 2

Java parsing xml file with appended data

Answers (2)

Related Questions