Reputation: 8624
I get arbitrary XML from a server and parse it using this Java code:
String xmlStr; // arbitrary XML input
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xmlStr));
return builder.parse(is);
}
catch (SAXException | IOException | ParserConfigurationException e) {
LOGGER.error("Failed to parse XML.", e);
}
Every once in a while, the XML input contains some unknown entity reference like
and fails with an error, such as org.xml.sax.SAXParseException: The entity "nbsp" was referenced, but not declared.
I could solve this problem by preprocessing the original xmlStr
and translating all problematic entity references before parsing. Here's a dummy implementation that works:
protected static String translateEntityReferences(String xml) {
String newXml = xml;
Map<String, String> entityRefs = new HashMap<>();
entityRefs.put(" ", " ");
entityRefs.put("«", "«");
entityRefs.put("»", "»");
// ... and 250 more...
for(Entry<String, String> er : entityRefs.entrySet()) {
newXml = newXml.replace(er.getKey(), er.getValue());
}
return newXml;
}
However, this is really unsatisfactory, because there are are a huge number of entity references which I don't want to all hard-code into my Java class.
Is there any easy way of teaching this entire list of character entity references to the DocumentBuilder?
Upvotes: 2
Views: 2323
Reputation: 86774
If you can change your code to work with StAX instead of DOM, the trivial solution is to use the XMLInputFactory
property IS_REPLACING_ENTITY_REFERENCES
set to false
.
public static void main(String[] args) throws Exception
{
String doc = "<doc> </doc>";
ByteArrayInputStream is = new ByteArrayInputStream(doc.getBytes());
XMLInputFactory xif = XMLInputFactory.newFactory();
xif.setProperty(javax.xml.stream.XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
XMLStreamReader xr = xif.createXMLStreamReader(is);
while(xr.hasNext())
{
int t = xr.getEventType();
switch(t) {
case XMLEvent.ENTITY_REFERENCE:
System.out.println("Entity: "+ xr.getLocalName());
break;
case XMLEvent.START_DOCUMENT:
System.out.println("Start Document");
break;
case XMLEvent.START_ELEMENT:
System.out.println("Start Element: " + xr.getLocalName());
break;
case XMLEvent.END_DOCUMENT:
System.out.println("End Document");
break;
case XMLEvent.END_ELEMENT:
System.out.println("End Element: " + xr.getLocalName());
break;
default:
System.out.println("Other: ");
break;
}
xr.next();
}
}
Output:
Start Document
Start Element: doc
Entity: nbsp null
End Element: doc
But that may require too much rewrite in your code if you really need the full DOM tree in memory.
I spent an hour tracing through the DOM implementation and couldn't find any way to make the DOM parser read from an XMLStreamReader
.
Also there is evidence in the code that the internal DOM parser implementation has an option similar to IS_REPLACING_ENTITY_REFERENCES
but I couldn't find any way to set it from the outside.
Upvotes: 1