Reputation: 30137
Suppose I have very big XML file with entries having <id>
tags or id=""
properties.
How to search by this id? Can I create some search index or something.
Currently I am using org.w3.dom
. Does it have some means for searching?
UPDATE
My big XML file is a downloaded Wikipedia. It is 40G size and has millions of records.
Is it possible to index it with something like Lucene and then search for IDs fast?
UPDATE2
Have tried BaseX
. It ate my XML and created database of 32Gb. Haven't understand if it truncated data or 32Gb is because of some compressing.
Unfortunately, searching by ID requires 70-80 seconds or longer. So it is longer than Mediawiki API query.
Upvotes: 2
Views: 2731
Reputation: 1309
So in order to read and write XML file, you need to parse data inside first. There are different types of parsers and major ones are DOM, SAX, StAX.
I wouldn't recommend DOM parser for XML parsing especially when it comes to parsing a large XML file. Because DOM parser reads everything into your memory first and then try to read data from it. Which is extremely inefficient if your XML files are really large. SAX and StAX parsers are basically improved version of DOM. Have a read on StAX parser in Java from here
I think StAX parser is the most suitable parser for reading large XML file.
FYI, here is a link to SAX parser too
Upvotes: 3