Reputation: 1547
I'm trying to parse, and replace values in a large xml file, ~45MB each. The Way I do this is:
private void replaceData(File xmlFile, File out)
{
DocumentBuilderFactory df = DocumentBuilderFactory.newInstance();
DocumentBuilder db = df.newDocumentBuilder();
Document xmlDoc = db.parse(xmlFile);
xmlDoc.getDocumentElement().normalize();
Node allData = xmlDoc.getElementsByTagName("Data").item(0);
Element ctrlData = getSubElement(allData, "ctrlData");
NodeList subData = ctrlData.getElementsByTagName("SubData");
int len = subData.getLength();
for (int logIndex = 0; logIndex < len; logIndex++) {
Node log = subData.item(logIndex);
Element info = getSubElement(log, "info");
Element value = getSubElement(info, "dailyInfo");
Node valueNode = value.getElementsByTagName("value").item(0);
valueNode.setTextContent("blah");
}
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
DOMSource s = new DOMSource(xmlDoc);
StreamResult r = new StreamResult(out);
t.transform(s, r);
} catch (TransformerException | ParserConfigurationException | SAXException | IOException e) {
throw e;
}
}
private static Element getSubElement(Node node, String elementName)
{
return (Element)((Element)node).getElementsByTagName(elementName).item(0);
}
I notice that as I am further along the for loop the longer it takes, and for an average of 100k node's it takes over 2 hours, while if I just break out smaller chunks by hand of 1k, it will take ~10s. Is there something that is inefficient with the way that this document is being parsed?
----EDIT----
Based on comments and answers to this, I switched over to using Sax and XmlStreamWriter. Reference/example here: http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
After moving to using SAX, memory usage for the replaceData function does not expand to size of XML file, and XML file processing time went to ~18 seconds on average.
Upvotes: 3
Views: 5842
Reputation: 163322
Why are you doing this in Java when XSLT is designed for the task?
45Mb is a big file to hold in memory, but still viable. The tree models used by good XSLT processors such as Saxon are much more efficient (both in storage space in in search speed) than a general purpose DOM (for example, because they are read-only). And XSLT has much more scope to optimize your code.
I can't reverse engineer your specification from your code, but I don't see anything in your description that is intrinsically non-linear. I don't see any reason why this should take more than 10 minutes or so in Saxon.
Upvotes: 2
Reputation: 21391
As people have mentioned in the comments loading the whole DOM
into memory especially for large XMLs can be very inefficient therefore a better approach is to use the SAX
parser that consumes constant memory. The drawback there is that you don't get the fluent API of having the whole DOM in memory and the visibility is quite limited if you want to perform complicated callback logic in nested nodes.
If all you are interesting in doing is parsing particular nodes and node families rather than parsing the whole XML then there is a better solution that gives you the best of both worlds and has been blogged about and open-sourced. It's basically a very light wrapper on top of SAX parser where you are registering the XML elements you are interested in and when you are getting the callback you have at your disposal their corresponding partial DOM to XPath.
This way you can keep your complexity at constant time (scaling to over 1GB of XML file as documented in the above blog) while maintaining the fluency of XPath-ing the DOM of the XML elements you are interested in.
Upvotes: 3