Reputation: 28566
I have huge XML. I need to parse that file and get every <elem/>
as single String to save them into database, but using method with low memory footprint, because file may be huge (~500MB). How to do that ? I'm looking for usable example of doing that. Example and my not very good solution below:
<?xml version="1.0" encoding="UTF-8"?>
<doc>
<header>...<header>
<elem>
<a/><b/><c>...</c>
</elem>
<elem>
<a>...</a><b/><c>...</c>
</elem>
<elem>
<a>...</a>
</elem>
...
</doc>
After split:
{'<elem/>', '<elem/>', ...}
Now i'm using SAX DefaultHandler like below, but I think is not a good solution:
class DataFileParser extends DefaultHandler {
StringBuffer sb;
boolean sElem = false; // is elem
...
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if(sElem) {
sb.append("<"+qName+">");
}
if (qName.equalsIgnoreCase("elem")) {
sb = new StringBuffer();
sb.append("<"+qName+">");
sElem = true;
}
...
}
public void endElement(String uri, String localName, String qName) throws SAXException {
if (qName.equalsIgnoreCase("elem")) {
sElem = false;
sb.append("</"+qName+">");
}
...
}
public void characters(char ch[], int start, int length) throws SAXException {
if(sElem) {
sb.append(new String(ch, start, length));
}
}
...
}
Upvotes: 0
Views: 522
Reputation: 163322
If you don't want to write any low-level Java code, there are other solutions. For example with Saxon-EE the following streaming transformation will do the trick:
<xsl:stylesheet xmlns="http://www.w3.org/1999/XSL/Transform"
xmlns:saxon="http://saxon.sf.net/"
version="3.0">
<xsl:template name="main">
<xsl:for-each select="saxon:stream(doc('big.xml'))/*/elem">
<xsl:result-document href="out{position()}.xml">
<xsl:copy-of select="."/>
</xsl:result-document>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Upvotes: 1
Reputation: 24722
Using SAX parser is, in fact, a good solution. You may want to consider writing to the database directly in endElement
. However, if you need to write the whole thing at once (like in single CLOB), you will have to save it somewhere no matter what parser you use. You could put in temp file for that.
In any case, SAX parser is most efficient solution since memory footprint is mostly dependent on amount of data that you handle not parser implementation.
Upvotes: 1