Suzan Cioc
Suzan Cioc

Reputation: 30137

How to search fast/indexed inside very BIG XML file?

Suppose I have very big XML file with entries having <id> tags or id="" properties.

How to search by this id? Can I create some search index or something.

Currently I am using org.w3.dom. Does it have some means for searching?

UPDATE

My big XML file is a downloaded Wikipedia. It is 40G size and has millions of records.

Is it possible to index it with something like Lucene and then search for IDs fast?

UPDATE2

Have tried BaseX. It ate my XML and created database of 32Gb. Haven't understand if it truncated data or 32Gb is because of some compressing.

Unfortunately, searching by ID requires 70-80 seconds or longer. So it is longer than Mediawiki API query.

Upvotes: 2

Views: 2731

Answers (1)

Jason
Jason

Reputation: 1309

So in order to read and write XML file, you need to parse data inside first. There are different types of parsers and major ones are DOM, SAX, StAX.

I wouldn't recommend DOM parser for XML parsing especially when it comes to parsing a large XML file. Because DOM parser reads everything into your memory first and then try to read data from it. Which is extremely inefficient if your XML files are really large. SAX and StAX parsers are basically improved version of DOM. Have a read on StAX parser in Java from here

StAX parser tutorial

I think StAX parser is the most suitable parser for reading large XML file.

FYI, here is a link to SAX parser too

SAX parser tutorial in Java

Upvotes: 3

Related Questions