Carsten
Carsten

Reputation: 4334

Java: How to split XML stream into small XML documents? XPath on streaming XML parser?

I need to read a large XML document from the network and split it up into smaller XML documents. In particular the stream I read from the network looks something like this:

<a> <b> ... </b> <b> ... </b> <b> ... </b> <b> ... </b> .... </a>

I need to break this up into chunks of

<a> <b> ... </b> <a>

(I only actually need the <b> .... </b> parts as long as the namespace bindings declared higher up (e.g. in <a> ) are moved to <b> if that makes it easier).

The file is too big for a DOM style parser, it has to be done streaming. Is there any XML library that can do this?

[Edit]

I think what I'm ideally looking for is something like the ability to do XPath queries on an XML stream where the stream parser only parses as far as necessary to return the next item in the result node set (and all its attributes and children). Doesn't have to be XPath, but something along the idea.

Thanks!

Upvotes: 4

Views: 8376

Answers (5)

innovimax
innovimax

Reputation: 560

You can do this with XProc language

<?xml version="1.0" encoding="ISO-8859-1"?>
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="1.0">
  <p:load href="in/huge-document.xml"/>
  <p:for-each>
    <p:iteration-source select="/a/b"/>
    <p:wrap match="/b" wrapper="a"/>
    <p:store>
       <p:with-option name="href" select="concat('part', p:iteration-position(), '.xml')">
          <p:empty/>
       </p:with-option>
    </p:store>
  </p:for-each>
</p:declare-step>

You can use QuiXProc (Streaming XProc implementation : http://code.google.com/p/quixproc/ ) to try to stream it also

Upvotes: 1

Jason
Jason

Reputation: 2673

go old school

StringBuilder buffer = new StringBuilder(1024 * 50);
BufferedReader reader = new BufferedReader(new FileReader(pstmtout));
String line;
while ((line = reader.readLine()) != null) {
  buffer.append(line);
  if (line.equalsIgnoreCase(endStatementTag)) {
    service.handle(buffer.toString());
    buffer.delete(0, buffer.length());
  }
}

Upvotes: 1

vtd-xml-author
vtd-xml-author

Reputation: 3377

As a XML splitter, VTD-XML is ideally suited for this task... it is also more memory efficient than DOM. The key method that simplify coding is VTDNav's getElementFragment()... below is the Java code for split input.xml into out0.xml and out1.xml

<a> <b> text1 </b>  <b> text2 </b> </a>

into

<a> <b> text1</b> </a> 

and

<a> <b> text2</b> </a>

using XPath

/a/b

The code

import java.io.*;
import com.ximpleware.*;

public class split {
    public static void main(String[] argv) throws Exception{
        VTDGen vg = new VTDGen();
        if (vg.parseFile("c:/split/input.xml", true)){
            VTDNav vn = vg.getNav();
            AutoPilot ap = new AutoPilot(vn);
            ap.selectXPath("/a/b");
            int i=-1,k=0;
            byte[] ba = vn.getXML().getBytes();
            while((i=ap.evalXPath())!=-1){
                FileOutputStream fos = new FileOutputStream("c:/split/out"+k+".xml");
                fos.write("<a>".getBytes());
                long l = vn.getElementFragment();
                fos.write(ba, (int)l, (int)(l>>32));
                fos.write("</a>".getBytes());
                k++;
            }
        }       
    }
}

For further reading, please visit http://www.devx.com/xml/Article/36379

Upvotes: 1

Adam Batkin
Adam Batkin

Reputation: 53024

I happen to like the XOM XML library, as its interface is simple, intuitive and powerful. To do what you want with XML, you can use your own NodeFactory and (for example) override the finishMakingElement() method. If it is making the element that you want (in your case, <b>) then you pass it along to whatever you need to do with it.

Upvotes: 0

Jimmy
Jimmy

Reputation: 1433

The JAXP SAX api with SAX filter is both fast and efficient. Good intro filters can be seen here

Upvotes: 2

Related Questions