Reputation: 529
I have to process XML files that contain potentially large (up to 2GB) content. In these files , the 'large' part of the content is not spread over the whole file but is contained in one single element (an encrypted file, hex encoded).
I have no leverage on the source of the files, so I need to deal with that situation.
A requirement is to keep a small memory foot print (< 500MB). I was able to read and process the file's contents in streaming mode using xml.sax which is doing it's job just fine.
The problem is, that these files also need to be validated against an XML schema definition (.xsd file), which seems not to be supported by xml.sax.
I found some up-to-date libraries for schema validation like xmlschema but none for doing the validation in a streaming/lazy fashion.
Can anyone recommend a way to do this?
Upvotes: 2
Views: 841
Reputation: 529
Michael Kay's answer had this nice idea of a content filter that can condense long text. This helped me solve my problem.
I ended up writing a simple text shrinker that pre-processes an XML file for me by reducing the text content size in named tags (like: "only keep the first 64 bytes of the text in the 'Data' and 'CipherValue' elements, don't touch anything else").
The resulting file then is small enought to feed it into a validator like xmlschema.
If anyone needs something similar: here is the code of the shrinker
If you use this, be careful
This indeed changes the content of the XML and could potentially cause problems, if the XML schema definition contains things like min or max length checks for the affected elements.
Upvotes: 1
Reputation: 163458
Many schema processors (such as Xerces and Saxon) operate in streaming mode, so there's no need to hold the data in memory while it's being validated. However, a 2Gb single text node is stretching Java's limits on the size of strings and arrays, and even a streaming processor is quite likely to want to hold the whole of a single node in memory.
If there are no validation constraints on the content of this text node (e.g. you don't need to validate that it is valid xs:base64Binary) then I would suggest using a schema validator (such as Saxon) that accepts SAX input, and supplying the input via a SAX filter that eliminates or condenses the long text value. A SAX parser supplies text to the ContentHandler in multiple chunks so there should be no limit in the SAX parser on the size of a text node. Saxon will try and combine the multiple chunks into a single string (or char array) and may fail at this stage either because of Java limits or because of the amount of memory available; but if your filter cuts out the big text node, this won't happen.
Upvotes: 2