Reputation: 63
I am having an issue when I parse an XML document that has numeric character references (ie  ). The problem I am running into is that when the document is parsed, the & is replaced with & ; (without the space before the ;), so my parsed document will contain & ;#xA0;. How do I stop this from happening? I have tried using xmlDoc.setExpandEntityReferences(false)
, but that doesnt seem to change anything.
Here is my code for parsing the document:
public static Document getXmlDoc(File xmlFile) throws ParserConfigurationException, SAXExeption, IOException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setIgnoringElementContentWhitespace(true);
factory.setExpandEntityReferences(false);
DocumentBuilder builder = factory.newDocumentBuilder();
return builder.parse(xmlFile);
}
Any help would be greatly appreciated.
EDIT:
The XML that is parsed form the above code is modified and then written back to a file. The code to do this is below:
public static File saveXmlDoc(Document xmlDocument, String outputToDir, String outputFilename) throws IOException {
String outputDir = outputToDir;
if (!outputDir.endWith(File.separator)) outputDir += File.separator;
if (!new FIle(outputDir).exists()) new File(outputDir).mkdir();
File xmlFile = new File(outputDir + outputFilename);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "no");
StreamResult saveResult = new StreamResult(outputDir + outputFilename);
DOMSource source = new DOMSource(xmlDocument);
transformer.transform(source, saveResult);
return xmlFile;
}
EDIT 2:
Fixed a typo for factory.setIgnoringElementContentWhitespace(true);
.
EDIT 3 - My Solution:
Since my reputation is too low to answer my own question, here is the solution I used to fix all of this.
Here are the functions I changed in order to resolve this issue:
To get the XML Document:
public static Document getXmlDoc(File xmlFile) throws ParserConfigurationException, SAXException, IOException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setIgnoringElementContentWhitespace(true);
factory.setExpandEntityReferences(false);
DocumentBuilder builder = factory.newDocumentBuilder();
return builder.parse(xmlFile);
}
To save the XML Document:
public static File saveXmlDoc(Document xmlDocument, String outputToDir, String outputFilename) throws Exception {
readNodesForHexConversion(xmlDocument.getChildNodes());
String xml = getXmlAsString(xmlDocument);
// write the xml out to a file
Exception writeError = null;
File xmlFile = null;
FileOutputStream fos = null;
try {
if (!new File(outputToDir).exists()) new File(outputToDir).mkdir();
xmlFile = new File(outputToDir + outputFilename);
if (!xmlFile.exists()) xmlFile.createNewFile();
fos = new FileOutputStream(xmlFile);
byte[] xmlBytes = xml.getBytes("UTF-8");
fos.write(xmlBytes);
fos.flush();
} catch (Exception ex) {
ex.printStackTrace();
writeError = ex;
} finally {
if (fos != null) fos.close();
if (writeError != null) throw writeError;
}
return xmlFile;
}
To convert the XML Document to String:
public static String getXmlAsString(Document xmlDocument) throws TransformerFactoryConfigurationError, TransformerException {
DOMSource domSource = new DOMSource(xmlDocument);
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
Transformer transformer;
transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(domSource, result);
return writer.toString();
}
Upvotes: 0
Views: 1752
Reputation: 1500953
I can't reproduce the problem at the moment. Here's a short but complete program which tries to:
import org.w3c.dom.*;
import java.io.*;
import javax.xml.*;
import javax.xml.parsers.*;
public class Test {
public static void main (String[] args) throws Exception {
byte[] xml = "<foo> </foo>".getBytes("UTF-8");
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setIgnoringElementContentWhitespace(true);
factory.setExpandEntityReferences(false);
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new ByteArrayInputStream(xml));
Element element = document.getDocumentElement();
String text = element.getFirstChild().getNodeValue();
System.out.println(text.length()); // Prints 1
System.out.println((int) text.charAt(0)); // Prints 160
}
}
Now it's not clear from the above XML would be written out again - and it would help if you'd show the code you're using to do that - but it's clear that the single-character value of the text node is not being read as an ampersand followed by "#xA0;" separately, as I believe your question describes it, so I'd be really surprised to see it written out as " ".
Can you write a similar short but complete program which does demonstrate the problem? Will continue to try to do so myself.
Upvotes: 1