dunce1
dunce1

Reputation: 323

What to do when a huge XML document is not well formed (Java)

I am using Java SAX parser to parse XML data sent from a third party source that is around 3 GB. I am getting an error resulting from the XML document not being well formed: The processing instruction target matching "[xX][mM][lL]" is not allowed.

As far as I understand, this is normally due to a character being somewhere it should not be.

Main problem: Cannot manually edit these files due to their very large size.

I was wondering if there was a workaround for files that are very large in size that cannot be opened and edited manually (due to their large size) and if there is a way to code it so that it would remove any problematic characters automatically.

Upvotes: 1

Views: 806

Answers (2)

Michael Kay
Michael Kay

Reputation: 163322

I would think the most likely explanation is that the file contains a concatenation of several XML documents, or perhaps an embedded XML document: either way, an XML declaration that isn't at the start of the file.

A lot now depends on your relationship with the supplier of the bad data. If they sent you faulty equipment or buggy software, you would presumably complain and ask them to fix it. But if you don't have a service relationship with the third party, you either have to change supplier or do the best you can with the faulty input, which means repairing the fault yourself. In general, you can't repair faulty XML unless you know what kind of fault you are looking for, and that can be very difficult to determine if the files are large (or if the failures are very rare).

The data isn't XML, so don't try to use XML tools to process it. Use text processing tools such as sed or awk. The first step is to search the file for occurrences of <?xml and see if that gives any hints.

Upvotes: 3

Moritz Petersen
Moritz Petersen

Reputation: 13057

This error occurs, if the declaration is anywhere but the beginning of the document. The reason might be

  1. Whitespace before the XML declaration
  2. Any hidden character before the XML declaration
  3. The XML declaration appears anywhere else in the document

You should start checking case #2, see here: http://www.w3.org/International/questions/qa-byte-order-mark#remove

If that doesn't help, you should remove leading whitespace from the document. You could do that by wrapping the original InputStream with another InputStream and use that to remove the whitespace.

The same can be done if you are facing case #3, but the implementation would be a bit more complex.

Upvotes: 0

Related Questions