JayAgl
JayAgl

Reputation: 147

What is an alternative to using DOM XML parser for large XML Documents for multiple find operations?

I am storing data for ranking users in XML documents - one row per user - containing a 36 char key, score, rank, and username as attributes.

<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<!DOCTYPE Ranks [<!ELEMENT Rank ANY ><!ATTLIST Rank id ID #IMPLIED>]>
<Ranks>
..<Rank id="<userKey>" score="36.0" name="John Doe" rank=15></Rank>..
</Ranks>

There are several such documents which are parsed on request using a DOM parser and kept in memory until the file is updated. This happens from within a HttpServlet which is backing a widget. Every time the widget is loaded it calls the servlet with a get request which then requires one of the documents to be queried. The queries on the documents require the following operations:

In my test environment the number of users is <100 and everything works well. However we are soon supposed to be delivering to a system with 200K+ users. I have serious concerns about the scalability of my approach - i.e. OutOfMemoryException!

I'm stuck for ideas for an implementation which balances performance and memory usage. While DOM is good for find operations it may choke because of the large size. I don't know much about StAX, but from what I have read it seems that it might solve the memory issue but could really slow down the queries as I will have to effectively iterate through the document to find the element of interest (Is that correct?).

Questions:

Thanks

Edit: I am not allowed to use databases.

Edit: Would it be better/neater to use a custom formatted file instead and use Regular expressions to search the file for the required entry?

Upvotes: 1

Views: 3903

Answers (3)

vtd-xml-author
vtd-xml-author

Reputation: 3377

For heavy due XML processing, VTD-XML is the most efficient option available, it is far more efficent than JDOM, DOM4j or DOM... the key is non-object oriented approach of its info-set modeling... it is also far less likely to cause out of memory exception... Read this 2013 paper for the comprehensive comparison/benchmark between various XML frameworks

Processing XML with Java – A Performance Benchmark

Upvotes: 0

Michael Kay
Michael Kay

Reputation: 163342

One of the big problems here is that DOM is not thread-safe, so even read operations need to be synchronized. From that point of view, using JDOM or XOM would definitely be better.

The other issue is the search strategy used to find the data. You really want the queries to be supported by indexing rather than using serial search. In fact, you need a decent query optimizer to generate efficient access paths. So given your constraint of not using a database, this sounds like a case for an in-memory XQuery engine with agressive optimization, for which the obvious candidate is Saxon-EE. But then I would say that, wouldn't I?

Upvotes: 2

Ben Taitelbaum
Ben Taitelbaum

Reputation: 7403

It sounds like you're using the xml document as a database. I think you'll be much happier using a proper database for this, and importing/exporting to xml as needed. Several databases work well, so you might as well use one that's well supported, like mysql or postgresql, although even sqlite will work better than xml.

In terms of SAX parsing, you basically build a large state machine that handles various events that occur while parsing (entering a tag, leaving a tag, seeing data, etc.). You're then on your own to manage memory (recording the data you see depending on the state you're in), so you're correct that it can have a better memory footprint, but running a query like that for every web request is ridiculous, especially when you can store all your data in a nice indexed database.

Upvotes: 2

Related Questions