dokaspar
dokaspar

Reputation: 8624

How to make XML Parser aware of all Character Entity References?

I get arbitrary XML from a server and parse it using this Java code:

String xmlStr; // arbitrary XML input
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); 
try {
    DocumentBuilder builder = factory.newDocumentBuilder();
    InputSource is = new InputSource(new StringReader(xmlStr));
    return builder.parse(is);
}
catch (SAXException | IOException | ParserConfigurationException e) {
    LOGGER.error("Failed to  parse XML.", e);
}

Every once in a while, the XML input contains some unknown entity reference like   and fails with an error, such as org.xml.sax.SAXParseException: The entity "nbsp" was referenced, but not declared.

I could solve this problem by preprocessing the original xmlStr and translating all problematic entity references before parsing. Here's a dummy implementation that works:

protected static String translateEntityReferences(String xml) {
    String newXml = xml;
    Map<String, String> entityRefs = new HashMap<>();
    entityRefs.put("&nbsp;", "&#160;");
    entityRefs.put("&laquo;", "&#171;");
    entityRefs.put("&raquo;", "&#187;");
    // ... and 250 more...
    for(Entry<String, String> er : entityRefs.entrySet()) {
        newXml = newXml.replace(er.getKey(), er.getValue());
    }
    return newXml;
}

However, this is really unsatisfactory, because there are are a huge number of entity references which I don't want to all hard-code into my Java class.

Is there any easy way of teaching this entire list of character entity references to the DocumentBuilder?

Upvotes: 2

Views: 2323

Answers (1)

Jim Garrison
Jim Garrison

Reputation: 86774

If you can change your code to work with StAX instead of DOM, the trivial solution is to use the XMLInputFactory property IS_REPLACING_ENTITY_REFERENCES set to false.

public static void main(String[] args) throws Exception
{
    String doc = "<doc>&nbsp;</doc>";
    ByteArrayInputStream is = new ByteArrayInputStream(doc.getBytes());

    XMLInputFactory xif = XMLInputFactory.newFactory();
    xif.setProperty(javax.xml.stream.XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
    XMLStreamReader xr = xif.createXMLStreamReader(is);

    while(xr.hasNext())
    {
        int t = xr.getEventType();
        switch(t) {
            case XMLEvent.ENTITY_REFERENCE:
                System.out.println("Entity: "+ xr.getLocalName());
                break;
            case XMLEvent.START_DOCUMENT:
                System.out.println("Start Document");
                break;
            case XMLEvent.START_ELEMENT:
                System.out.println("Start Element: " + xr.getLocalName());
                break;
            case XMLEvent.END_DOCUMENT:
                System.out.println("End Document");
                break;
            case XMLEvent.END_ELEMENT:
                System.out.println("End Element: " + xr.getLocalName());
                break;
            default:
                System.out.println("Other:  ");
                break;
        }
        xr.next();
    }
}

Output:

Start Document
Start Element: doc
Entity: nbsp null
End Element: doc

But that may require too much rewrite in your code if you really need the full DOM tree in memory.

I spent an hour tracing through the DOM implementation and couldn't find any way to make the DOM parser read from an XMLStreamReader.

Also there is evidence in the code that the internal DOM parser implementation has an option similar to IS_REPLACING_ENTITY_REFERENCES but I couldn't find any way to set it from the outside.

Upvotes: 1

Related Questions