Removing nodes with invalid tag names from a xml document

Question

I transform xml with the Saxon XSLT2 processor (using Java + the Saxon S9API) and have to deal with xml-documents as the source, which contain invalid characters as tag names and therefore can't be parsed by the document-builder.

Example:

Code:

import net.sf.saxon.s9api.*; [...] /* XSLT Processor & Compiler */ proc = new Processor(false); /* build document from input*/ XdmNode source = proc.newDocumentBuilder().build(new StreamSource(input));

Error:

Error on line X column Y SXXP0003: Error reported by XML parser: Element type "E" must be followed by either attribute specifications, ">" or "/>".

The exclamation mark and the tag name just being space are currently my only invalid tags. I am searching for a more robust solution rather than just removing whole lines of the (formated) xml.

With some mind-bending I could come up with a regular expression to identify the invalid strings, but would struggle with the removal of the nodes containing attributes and child-nodes.

Thank you for your help!

Michael Kay · Accepted Answer

If the input contains invalid tags then it is not XML. It's best to get your mind-set right by referring to these as non-XML documents rather than XML documents; that helps to make it clear that to process non-XML documents, you need non-XML tools. (Forget about "nodes" - there are no nodes until the document has been parsed, and it can't be parsed until you have turned it into well-formed XML). To turn non-XML into XML, you will typically want to use non-XML tools that are good at text manipulation, such as Perl. Of course, it's much better to fix the problem at source: all the benefits of XML are lost if people generate data in private non-XML formats.

Removing nodes with invalid tag names from a xml document

Answers (1)

Related Questions