TransformerFactory corrupts < input > and < br > tags inside < html > tag

With simple code parsing and rewriting simple xml, some strange thing occurs with this

INPUT:

<html>
<input></input>
</html>

gives OUTPUT (not well-formed):

<html>
<input>
</html>

same thing occurs with < input/ >, or < br/ >.

It doesn't occur inside < html2 >, with other tags, ...

The code is classical:

// READ XML
DocumentBuilderFactory builderFactory =DocumentBuilderFactory.newInstance();
builderFactory.setNamespaceAware(true);
DocumentBuilder builder = builderFactory.newDocumentBuilder();

// PARSE
Document document = builder.parse(new InputSource(new StringReader(_xml_source)));

// WRITE XML

TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer transformer = transFactory.newTransformer();
StringWriter buffer = new StringWriter();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.transform(new DOMSource(document), new StreamResult(buffer));
String output = buffer.toString();

It it a known bug ?

Upvotes: 1

Views: 684

Answers (1)

Andreas
Andreas

Reputation: 159215

XSLT defines an output method, which can be xml, html, or text.

The specification says that the default output method should be html if the root node is <html>, otherwise it should be xml.

With the xml method, you will get <input/>.

With the html method, you will get <input>, because the HTML specification says so.

You can explicitly give the output method, if you want:

transformer.setOutputProperty(OutputKeys.METHOD, "xml");

So that a document with an <html> root node will output XML, i.e. <input/>.

Quotes

XSLT output method:

The default for the method attribute is chosen as follows. If

  • the root node of the result tree has an element child,
  • the expanded-name of the first element child of the root node (i.e. the document element) of the result tree has local part html (in any combination of upper and lower case) and a null namespace URI, and
  • any text nodes preceding the first element child of the root node of the result tree contain only whitespace characters,

then the default output method is html; otherwise, the default output method is xml. The default output method should be used if there are no xsl:output elements or if none of the xsl:output elements specifies a value for the method attribute.

HTML empty tags:

Some HTML element types have no content. For example, the line break element BR has no content; its only role is to terminate a line of text. Such empty elements never have end tags. The document type definition and the text of the specification indicate whether an element type is empty (has no content) or, if it can have content, what is considered legal content.

Upvotes: 3

Related Questions