wsams
wsams

Reputation: 2637

HTML character entities stripped when building org.w3c.dom.Document

I have a Java XML utility class. The buildDocument() method accepts an XML string and returns org.w3c.dom.Document. The particular XML I'm passing to it is an xhtml 1.1 document.

The issue is if there are HTML named entities like,

<p>Preserve dagger &dagger;</p>

the output is,

<p>Preserve dagger </p>

It does preserve &lt;, &gt;, &amp;, &quot;.

Here is the class creating Document.

package com.example;

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.StringWriter;
import java.nio.charset.StandardCharsets;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;

public final class XMLUtils {

    private XMLUtils() {
    }

    public static Document buildDocument(String xml) throws ParserConfigurationException, SAXException, IOException {

        DocumentBuilderFactory domFactory = DocumentBuilderFactory
            .newInstance();
        domFactory.setNamespaceAware(true);

        domFactory.setFeature("http://xml.org/sax/features/validation", false);
        domFactory.setFeature(
            "http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
        domFactory.setFeature(
            "http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
        domFactory.setCoalescing(false);
        DocumentBuilder builder = domFactory.newDocumentBuilder();

        Document doc = builder.parse(new ByteArrayInputStream(
                xml.getBytes(StandardCharsets.UTF_8)));

        try {
            DOMSource domSource = new DOMSource(doc);
            StringWriter writer = new StringWriter();
            StreamResult result = new StreamResult(writer);
            TransformerFactory tf = TransformerFactory.newInstance();
            Transformer transformer = tf.newTransformer();
            transformer.transform(domSource, result);
            System.out.println("XML OUT: \n" + writer.toString());
        } catch (Exception ex) {

        }

        return doc;
    }
}

I think these are the relevant dependencies.

<dependency>
    <groupId>net.sf.saxon</groupId>
    <artifactId>Saxon-HE</artifactId>
    <version>9.5.1-6</version>
</dependency>
<dependency>
    <groupId>xerces</groupId>
    <artifactId>xercesImpl</artifactId>
    <version>2.11.0</version>
    <type>jar</type>
</dependency>
<dependency>
    <groupId>xml-resolver</groupId>
    <artifactId>xml-resolver</artifactId>
    <version>1.2</version>
    <type>jar</type>
</dependency>

Any ideas on how to preserve these entities? Thanks, /w

Upvotes: 1

Views: 1161

Answers (1)

It took me some time to find a solution to this problem, apparently it is difficult to search the right keywords... since I found this one before finding the best answer, I thought it was worth linking it here, even if it is on StackOverflow anyway. There you go: Keep numeric character entity characters such as `&#10; &#13;` when parsing XML in Java

It is not quite satisfactory, but at least it explains very well why there is no better solution.

Upvotes: 0

Related Questions