HTML character entities stripped when building org.w3c.dom.Document

Question

I have a Java XML utility class. The buildDocument() method accepts an XML string and returns org.w3c.dom.Document. The particular XML I'm passing to it is an xhtml 1.1 document.

The issue is if there are HTML named entities like,

Preserve dagger †

the output is,

Preserve dagger

It does preserve <, >, &, ".

Here is the class creating Document.

package com.example;

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.StringWriter;
import java.nio.charset.StandardCharsets;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;

public final class XMLUtils {

    private XMLUtils() {
    }

    public static Document buildDocument(String xml) throws ParserConfigurationException, SAXException, IOException {

        DocumentBuilderFactory domFactory = DocumentBuilderFactory
            .newInstance();
        domFactory.setNamespaceAware(true);

        domFactory.setFeature("http://xml.org/sax/features/validation", false);
        domFactory.setFeature(
            "http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
        domFactory.setFeature(
            "http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
        domFactory.setCoalescing(false);
        DocumentBuilder builder = domFactory.newDocumentBuilder();

        Document doc = builder.parse(new ByteArrayInputStream(
                xml.getBytes(StandardCharsets.UTF_8)));

        try {
            DOMSource domSource = new DOMSource(doc);
            StringWriter writer = new StringWriter();
            StreamResult result = new StreamResult(writer);
            TransformerFactory tf = TransformerFactory.newInstance();
            Transformer transformer = tf.newTransformer();
            transformer.transform(domSource, result);
            System.out.println("XML OUT: 
" + writer.toString());
        } catch (Exception ex) {

        }

        return doc;
    }
}

I think these are the relevant dependencies.


    net.sf.saxon
    Saxon-HE
    9.5.1-6


    xerces
    xercesImpl
    2.11.0
    jar


    xml-resolver
    xml-resolver
    1.2
    jar

Any ideas on how to preserve these entities? Thanks, /w

HTML character entities stripped when building org.w3c.dom.Document

Answers (1)

Related Questions