Work with raw text in javax.xml.transform.Transformer

Question

While working with an XML document, I use strings that already contain XML entities and wish them to be inserted as-is. However, this happens instead:

String s = "This — That";
....
document.appendChild(document.createTextNode(s));
....
transformer.transform(new DOMSource(document), new StreamResult(stringWriter));

System.out.println(stringWriter.toString()); // outputs "This &mdash; That" at the relevant Node.

I have no control over the input string and I need exactly the output "This — That".

If I use StringEscapeUtils.unescapeHtml, the output is "This — That" which is not what I need.

I also tried several versions of transformer.setOutputProperty(OutputKeys.ENCODING, "encoding") but haven't found an encoding that converts "—" to "—".

What can I do to prevent javax.xml.transform.Transformer from re-escaping already correctly escaped text or how can I transform the input to get entities in the output?

Please explain how this is a duplicate.

The question referenced had the problem that " " was being converted into CRLF because the entities were being resolved. The solution was to escape the entities.

My problem is the reverse. The text is already escaped and the transformer is re-escaping the text. "—" is outputting "—".

I cannot use the solution to post-convert all "&" -> "&" because not all nodes represent html.

More complete code:

TransformerFactory factory = TransformerFactory.newInstance();
Transformer t = factory.newTransformer();
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbFactory.newDocumentBuilder();
Document document = builder.newDocument();
Element rootElement = document.createElement("Test");
rootElement.appendChild(document.createTextNode("This — That");
document.appendChild(rootElement);

DOMImplementation domImpl = bgDoc.getImplementation();
DocumentType docType = domImpl.createDocumentType("Test",
                "-//Company//program//language",
                "test.dtd");
t.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, docType.getPublicId());
t.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, docType.getSystemId());
StringWriter writer = new StringWriter();
StreamResult rslt = new StreamResult(writer);
Source src = new DOMSource(document);
t.transform(src, rslt);
System.out.println(writer.toString());

// outputs xml header, then "This &mdash; That"

ivan_pozdeev · Accepted Answer

The fact is, once you have a DOM tree, there's no longer a string with —: it's instead represented internally as a Unicode string.

So, to input the raw string, you need to parse it to a Node, and to output, serialize a Node.

Regarding serialization, there are a few other questions including Change the com.sun.org.apache.xml.internal.serialize.XMLSerializer & com.sun.org.apache.xml.internal.serialize.OutputFormat .

To parse a single node, there is LSParser.parseWithContext.

Work with raw text in javax.xml.transform.Transformer

Answers (1)

Related Questions