gavenkoa
gavenkoa

Reputation: 48713

Java: processing large XML files - extracting data without coding state automata?

I am not experienced in Java XML processing. My colleague rapidly made implementation on JAXP SAX parser so large XML isn't loaded to memory and we operated on streams. That mean that we implemented callback interface with methods like:

public void startElement(..., String elementName, ...){ ... }
public void characters(char [] buf, int offset, int len) { ... }

Implementation maintain state of current position in tags hierarchy managed by stack of element names and depth.

Each startElement/endElement full of spaghetti if/case and register callbacks which called in characters method to decide a need and an algorithm how to extract and where to save new partially processed portion of data. This code poured by filtering logic. Actual logic are larger but not harder.

On each closing 2nd level tag if filters make positive decision we pass gathered data to other place, cleanup current context state and begin process another portion of data.

Our logic is primitive - if lvl2 tag is person and has subtags in that order: skills / skill / id with specified value for id - extract lvl3 email tag value + lvl4 tag value address / city.

This task isn't XPath as we extract several categories at once and if I properly understood XPath operate on DOM and isn't stream oriented.

I see possible use of XSLT (which is stream oriented language) but seems that it scope - from one XML document make another XML document. It is possible to pipe large document through XSLT processor to build easy to process XML with descriptive XSLT source code and then process resulted data with SAX parser. But this look like bad decision.

What Java technology used for extracting data from regular structured large XML stream with using descriptive instruction (better in XPath like reduced syntax that define tag ordering from root and checks for tag/attribute values) when and what to extract and that provide callback extension point to pass extracted portion of data for further processing?

My main goal make code more maintainable by expressing extraction rules in descriptive way and avoid writing toy custom finite state automata to track context where we stay in SAX parser.

Upvotes: 0

Views: 282

Answers (1)

stringy05
stringy05

Reputation: 7067

SAX is old-skool and, as you point out, you end up with lots of logic inside your startElement callbacks.

StAX is the streaming parser which I would think is much more suited to your use case as it allows you to pull events from the XML stream so there's no DOM-like requirement to load the entire document and you get more support for XML semantics than the SAX approach. StAX is described here

Upvotes: 1

Related Questions