Nate Uni
Nate Uni

Reputation: 943

How to efficiently make a large XML file searchable in a web application?

I have an XML document and I need to make it searchable via a webapp. The document is currently only 6mb.. but could be extrememly large, thus from my research SAX seems the way to go.

So my question is, given a search term do I:

  1. Do I load the document in memory once (into a list of beans and then store it in memory)? And then search it when need be? or

  2. Parse the document looking for the desired search term and only add the matches to the list of beans? And repeat this process with each search?

I am not that experienced with webapps, but I am trying to figure out the optimal way to approach this, does anyone with Tomcat, SAX and Java Web apps have any suggestions as to which would be optimum?

Regards, Nate

Upvotes: 1

Views: 1426

Answers (3)

Serge Ballesta
Serge Ballesta

Reputation: 149155

When you say that your XML file could be very large, I assume you do not want to keep it in memory. If you want it to be searchable, I understand that you want indexed accesses, without a full read at each time. IMHO, the only way to achieve that is to parse the file and load the data in a lightweight file database (Derby, HSQL or H2) and add relevant indexes to the database. Databases do allow indexed search on off memory data, XML files do not.

Upvotes: 1

Michael Kay
Michael Kay

Reputation: 163625

Searching the file using XPath or XQuery is likely to be very fast (quite fast enough unless you are talking thousands of transactions per second). What takes time is parsing the file - building a tree in memory so that XPath or XQuery can search it. So (as others have said) a lot depends on how frequently the contents of the file change. If changes are infrequent, you should be able to keep a copy of the file in shared memory, so the parsing cost is amortized over many searches. But if changes are frequent, things get more complicated. You could try keeping a copy of the raw XML on disk, and a copy of the parsed XML in memory, and keeping the two in sync. Or you could bite the bullet and move to using an XML database - the initial effort will pay off in the end.

Your comment that "SAX is the way to go" would only be true if you want to parse the file each time you search it. If you're doing that, then you want the fastest possible way to parse the file. But a much better way forward is to avoid parsing it afresh on each search.

Upvotes: 0

BatScream
BatScream

Reputation: 19700

Assuming your search field is a field that is known to you, for example let the structure of the xml be:

<a>....</a>
<x>
<y>search text1</y>
<z>search text2</z>
</x>
<b>...</b>

and say the search has to be made on the 'x' and its children, you can achieve this using STAX parser and JAXB.

To understand the difference between STAX and SAX, please refer:

When should I choose SAX over StAX?

Using these APIs you will avoid storing the entire document in the memory. Using STAX parser, you parse the document, when you encounter the 'x' tag load it into memory(java beans) using JAXB.

Note: Only x and its children will be loaded to memory, not the entire document parsed till now. Do not use any approaches that use DOM parsers.

Sample code to load only the part of the document where the search field is present.

XMLInputFactory xif = XMLInputFactory.newFactory();
StreamSource xml = new StreamSource("file");
XMLStreamReader xsr = xif.createXMLStreamReader(xml);
xsr.nextTag();
while(!xsr.getLocalName().equals("x")) {
    xsr.nextTag();
}

JAXBContext jc = JAXBContext.newInstance(X.class);
Unmarshaller unmarshaller = jc.createUnmarshaller();
JAXBElement<Customer> jb = unmarshaller.unmarshal(xsr, X.class);
xsr.close();

X x = jb.getValue();
System.out.println(x.y.content);

Now you have the field content to return the appropriate field. When the user again searches for the same field under 'x', give the results from the memory and avoid parsing the XML again.

Upvotes: 1

Related Questions