Christian Tausch
Christian Tausch

Reputation: 3

Removing nodes with invalid tag names from a xml document

I transform xml with the Saxon XSLT2 processor (using Java + the Saxon S9API) and have to deal with xml-documents as the source, which contain invalid characters as tag names and therefore can't be parsed by the document-builder.

Example:

<A>
   <B />
   <C>
      <D />
   </C>
   <E!_RANDOM_ />
   < />
</A>

Code:

import net.sf.saxon.s9api.*;

[...]

/* XSLT Processor & Compiler */
proc = new Processor(false);

/* build document from input*/
XdmNode source = proc.newDocumentBuilder().build(new StreamSource(input));

Error:

Error on line X column Y 
SXXP0003: Error reported by XML parser: Element type
"E" must be followed by either attribute specifications, ">" or "/>".

The exclamation mark and the tag name just being space are currently my only invalid tags. I am searching for a more robust solution rather than just removing whole lines of the (formated) xml.

With some mind-bending I could come up with a regular expression to identify the invalid strings, but would struggle with the removal of the nodes containing attributes and child-nodes.

Thank you for your help!

Upvotes: 0

Views: 608

Answers (1)

Michael Kay
Michael Kay

Reputation: 163539

If the input contains invalid tags then it is not XML. It's best to get your mind-set right by referring to these as non-XML documents rather than XML documents; that helps to make it clear that to process non-XML documents, you need non-XML tools. (Forget about "nodes" - there are no nodes until the document has been parsed, and it can't be parsed until you have turned it into well-formed XML). To turn non-XML into XML, you will typically want to use non-XML tools that are good at text manipulation, such as Perl. Of course, it's much better to fix the problem at source: all the benefits of XML are lost if people generate data in private non-XML formats.

Upvotes: 2

Related Questions