How to make XML Parser aware of all Character Entity References?

Question

I get arbitrary XML from a server and parse it using this Java code:

String xmlStr; // arbitrary XML input
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); 
try {
    DocumentBuilder builder = factory.newDocumentBuilder();
    InputSource is = new InputSource(new StringReader(xmlStr));
    return builder.parse(is);
}
catch (SAXException | IOException | ParserConfigurationException e) {
    LOGGER.error("Failed to  parse XML.", e);
}

Every once in a while, the XML input contains some unknown entity reference like and fails with an error, such as org.xml.sax.SAXParseException: The entity "nbsp" was referenced, but not declared.

I could solve this problem by preprocessing the original xmlStr and translating all problematic entity references before parsing. Here's a dummy implementation that works:

protected static String translateEntityReferences(String xml) {
    String newXml = xml;
    Map entityRefs = new HashMap<>();
    entityRefs.put(" ", " ");
    entityRefs.put("«", "«");
    entityRefs.put("»", "»");
    // ... and 250 more...
    for(Entry er : entityRefs.entrySet()) {
        newXml = newXml.replace(er.getKey(), er.getValue());
    }
    return newXml;
}

However, this is really unsatisfactory, because there are are a huge number of entity references which I don't want to all hard-code into my Java class.

Is there any easy way of teaching this entire list of character entity references to the DocumentBuilder?

How to make XML Parser aware of all Character Entity References?

Answers (1)

Related Questions