marioosh
marioosh

Reputation: 28566

How to split XML ? Some example?

I have huge XML. I need to parse that file and get every <elem/> as single String to save them into database, but using method with low memory footprint, because file may be huge (~500MB). How to do that ? I'm looking for usable example of doing that. Example and my not very good solution below:

<?xml version="1.0" encoding="UTF-8"?>
<doc>
  <header>...<header>
  <elem>
     <a/><b/><c>...</c>
  </elem>
  <elem>
     <a>...</a><b/><c>...</c>
  </elem>
  <elem>
     <a>...</a>
  </elem>
  ...
</doc>

After split:

{'<elem/>', '<elem/>', ...}

Now i'm using SAX DefaultHandler like below, but I think is not a good solution:

class DataFileParser extends DefaultHandler {

        StringBuffer sb; 
        boolean sElem = false; // is elem

        ...

        public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
                if(sElem) {
                    sb.append("<"+qName+">");
                }
                if (qName.equalsIgnoreCase("elem")) {
                    sb = new StringBuffer();
                    sb.append("<"+qName+">");
                    sElem = true;
                }
                ...
        }               

        public void endElement(String uri, String localName, String qName) throws SAXException {
                if (qName.equalsIgnoreCase("elem")) {
                    sElem = false;
                    sb.append("</"+qName+">");
                }
                ...
        }

        public void characters(char ch[], int start, int length) throws SAXException {
              if(sElem) {
                   sb.append(new String(ch, start, length));
               }
        }


 ...
}

Upvotes: 0

Views: 522

Answers (2)

Michael Kay
Michael Kay

Reputation: 163322

If you don't want to write any low-level Java code, there are other solutions. For example with Saxon-EE the following streaming transformation will do the trick:

<xsl:stylesheet xmlns="http://www.w3.org/1999/XSL/Transform"
   xmlns:saxon="http://saxon.sf.net/" 
   version="3.0">

<xsl:template name="main">
  <xsl:for-each select="saxon:stream(doc('big.xml'))/*/elem">
    <xsl:result-document href="out{position()}.xml">
      <xsl:copy-of select="."/>
    </xsl:result-document>
  </xsl:for-each>
</xsl:template>

</xsl:stylesheet>

Upvotes: 1

Alex Gitelman
Alex Gitelman

Reputation: 24722

Using SAX parser is, in fact, a good solution. You may want to consider writing to the database directly in endElement . However, if you need to write the whole thing at once (like in single CLOB), you will have to save it somewhere no matter what parser you use. You could put in temp file for that.

In any case, SAX parser is most efficient solution since memory footprint is mostly dependent on amount of data that you handle not parser implementation.

Upvotes: 1

Related Questions