Michael Kročka
Michael Kročka

Reputation: 655

How to iteratively update an xml file that won't fit into memory?

I have a 10GB xml file that is parsed from the en-wikipedia-articles-pages-latest.xml file. My 10GB xml file contains xml elements that have the word "football" somewhere in them (in the text). Now my goal is to create a new output xml file that only contains player names and their corresponding teams throughout the years. Let's say I come across a Lionel Messi page, I parse the infobox which contains the information I need and lastly write it to an xml file. The problem is I can come across an unknown footballer, or a page about a footballer that has an old / broken infobox. Then I come across a football team that contains information about this unknown footballer with a broken infobox. The data in the new output xml is already written, but should be overwritten by this new information. My problem is that I can't keep the new output xml as an object in memory, because it's too large. Then again, I don't want to sequentially scan the new output xml file and try to look for a concrete entry. My question is whether there exists a general approach on how to handle this kind of situation.

Upvotes: 0

Views: 37

Answers (1)

Michael Kay
Michael Kay

Reputation: 163360

One approach is to put the whole thing in an XML database such as eXistDB or BaseX.

Another approach is to organise the work as a pipeline of streaming transformations (e.g. using XSLT 3.0). That's rather more work, but will ultimately be faster.

Upvotes: 2

Related Questions