John Allison
John Allison

Reputation: 996

How can I process InputStream of XML events through several handlers and to another input stream in Java using StAX?

I need to parse an XML file using StAX (as the files are too big to keep them in memory) and transform certain events / elements in the process. I have an InputStream associated with the XML file at hand. My idea is to design several handlers, feed the input stream through all handlers and to another input stream (because the client of the transformation expects an input stream). I've come up with the following code so far:

An interface that all handlers implement:

public interface EventHandler {

  XMLEvent process(XMLEvent event);
}

Several handlers like the following. One such handler may for example add an attribute to an element, another handler may somehow transform an element's text, etc.

public class Handler1 implements EventHandler {

  @Override
  public XMLEvent process(XMLEvent event) {
    if (supports(event)) {
      // processing logic omitted
    }
  }

  private boolean supports(XMLEvent event) {
    // condition to process an event omitted
  }
}

The processing class that gets an original input stream and returns a processed input stream:

public class XmlProcessor {
  
  @Autowired
  private Set<EventHandler> eventHandlers;

  public InputStream process(InputStream inputStream) {
    InputStream result;
    XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
    XMLStreamReader xmlReader = xmlInputFactory.createXMLStreamReader(inputStream);
    XMLEvent event = null;
    while (xmlReader.hasNext()) {
      event = xmlReader.next();
      for (EventHandler eh : eventHandlers) {
        event = eh.process(event);
      }
      // here I need to somehow add the processed event to the resulting input stream
      // that I will return from this method.
    }
    return result;
  }
}

I'm stuck trying to find a way to feed the processed event to the resulting InputStream. How do I do this? Do I need to run the while loop in another thread and use PipedInputStream and PipedOutputStream for this process, or can I achieve this in a single thread?

Upvotes: 0

Views: 563

Answers (1)

Michael Kay
Michael Kay

Reputation: 163458

It's usually simpler, in my view, to implement a pipeline in push mode (where the supplier of data makes "send" calls to the recipient) rather than in pull mode (where the recipient makes "readNext" calls to the supplier). This generally makes it simpler when there isn't a one-to-one correspondence of events, for example when one event in the first filter step turns into multiple events in the next.

However, whether you're using pull mode or push mode, the object passed between the stages in your pipeline should be a stream of XML events, not a stream of bytes. If you pass a stream of bytes, then each stage in the pipeline is going to have to do XML parsing and serialization, which makes it very inefficient. The only benefit of using a stream of bytes would be if the stages of the pipeline are running in different processes perhaps on different machines, so they do not share memory.

You're right to observe that if a stage in the pipeline wants to pull from the previous stage and push to the next stage (that is, if it wants to own the control loop), then each stage is going to have to run in a separate thread -- unless you're using a programming language that supports co-routines, such as C# or Javascript with the "yield" construct.

Another option is to write the transformation steps as XSLT 3.0 streaming transformations. Two implementations of XSLT 3.0 streaming have been produced: Saxon works internally in push mode, Exselt (not currently available) works internally in pull mode, but if your code is in XSLT then you're insulated from this because your own code is entirely declarative.

See also my paper You pull, I'll push: on the polarity of pipelines published at Balisage 2009 (http://www.balisage.net/Proceedings/vol3/html/Kay01/BalisageVol3-Kay01.html)

Upvotes: 1

Related Questions