How to get a streaming Iterator[Node] from a large XML document?

Question

I need to process XML documents that consist of a very large number of independent records, e.g.


    
         Kermit
         Frog
         Singer
    
    
         Oscar
         Grouch
         Garbageman
    
    ...

In some cases these are just big files, but in others they may come from a streaming source.

I can't just scala.xml.XmlLoader.load() it because I don't want to hold the whole document in memory (or wait for the input stream to close), when I only need to work with one record at a time. I know I can use XmlEventReader to stream the input as a sequence of XmlEvents. These are however much less convenient to work with than scala.xml.Node.

So I'd like to get a lazy Iterator[Node] out of this somehow, in order to operate on each individual record using the convenient Scala syntax, while keeping memory usage under control.

To do this myself, I could start with an XmlEventReader, build up a buffer of events between each matching start and end tag, and then construct a Node tree from that. But, is there an easier way that I've overlooked? Thanks for any insights!

huynhjl · Accepted Answer

You can use the underlying parser used by XMLEventReader through ConstructingParser and process your employee nodes below the top level with a callback. You just have to be careful discarding the data as soon as processed:

import scala.xml._

def processSource[T](input: Source)(f: NodeSeq => T) {
  new scala.xml.parsing.ConstructingParser(input, false) {
    nextch // initialize per documentation
    document // trigger parsing by requesting document

    var depth = 0 // track depth

    override def elemStart(pos: Int, pre: String, label: String,
        attrs: MetaData, scope: NamespaceBinding) {
      super.elemStart(pos, pre, label, attrs, scope)
      depth += 1
    }
    override def elemEnd(pos: Int, pre: String, label: String) {
      depth -= 1
      super.elemEnd(pos, pre, label)
    }
    override def elem(pos: Int, pre: String, label: String, attrs: MetaData,
        pscope: NamespaceBinding, nodes: NodeSeq): NodeSeq = {
      val node = super.elem(pos, pre, label, attrs, pscope, nodes)
      depth match {
        case 1 =>  // dummy final roll up
        case 2 => f(node); NodeSeq.Empty // process and discard employee nodes
        case _ => node // roll up other nodes
      }
    }
  }
}

Then you use like this to process each node at the second level in constant memory (assuming the nodes at the second level aren't getting an arbitrary number of children):

processSource(src){ node =>
  // process here
  println(node)
}

The benefit compared to XMLEventReader is that you don't use two threads. Also you don't have to parse the node twice compared to your proposed solution. The drawback is that this relies on the inner workings of ConstructingParser.

How to get a streaming Iterator[Node] from a large XML document?

Answers (2)

Related Questions