Fiach Reid
Fiach Reid

Reputation: 7059

Indexing a large XML file

Given a large (74GB) XML file, I need to read specific XML nodes by a given Alphanumeric ID. It takes too long to read from top-to-bottom of the file looking for the ID.

Is there an analogy of an Index for XML files like there is for relational databases?, I imagine a small Index file, where the Alphanumeric ID is quick to find, and points to the location in the larger file.

Do Index files for XML exist?, how can they be implemented in C#?

Upvotes: 1

Views: 674

Answers (2)

Michael Kay
Michael Kay

Reputation: 163322

XML databases such as BaseX, eXistDB, or MarkLogic do what you are looking for: they load XML documents into a persistent form on disk and allow fast access to parts of the document by use of indexes.

Some XML databases are optimized for handling many small documents, others are able to handle a small number of large documents, so choose your product carefully (I can't advise you on this), and consider breaking the document up into smaller parts as it is loaded.

If you need to split the large document into lots of small documents, consider a streaming XSLT 3.0 processor such as Saxon-EE. I would expect that processing 75Gb should take about an hour: dependent, obviously, on the speed of your machine.

Upvotes: 2

No, that is beyond of the scope of what XML tries to achieve. If the XML does not change often and your read from it a lot, I would propose rewriting its content into a local SQLite DB once-per-change and then reading from the database instead. When doing the rewriting, remember that SAX-style XML reading is your friend in the case of huge files like this.

Theoretically, you can create a sort-of index by remembering location of already discovered IDs and then parse on your own, but that would be very brittle. XML si not simple enough for you to parse it on your own and hope you will be standard compliant.

Of course, I suppose here that you can't do anything with the larger design itself: as others noted, the size of that file suggests that there is an architectural problem.

Upvotes: 0

Related Questions