Validating XML with large text element against XML Schema (xsd)

Question

I have to process XML files that contain potentially large (up to 2GB) content. In these files , the 'large' part of the content is not spread over the whole file but is contained in one single element (an encrypted file, hex encoded).
I have no leverage on the source of the files, so I need to deal with that situation.

A requirement is to keep a small memory foot print (< 500MB). I was able to read and process the file's contents in streaming mode using xml.sax which is doing it's job just fine.

The problem is, that these files also need to be validated against an XML schema definition (.xsd file), which seems not to be supported by xml.sax.
I found some up-to-date libraries for schema validation like xmlschema but none for doing the validation in a streaming/lazy fashion.

Can anyone recommend a way to do this?

p1234 · Accepted Answer

Michael Kay's answer had this nice idea of a content filter that can condense long text. This helped me solve my problem.

I ended up writing a simple text shrinker that pre-processes an XML file for me by reducing the text content size in named tags (like: "only keep the first 64 bytes of the text in the 'Data' and 'CipherValue' elements, don't touch anything else").

The resulting file then is small enought to feed it into a validator like xmlschema.

If anyone needs something similar: here is the code of the shrinker

If you use this, be careful
This indeed changes the content of the XML and could potentially cause problems, if the XML schema definition contains things like min or max length checks for the affected elements.

Validating XML with large text element against XML Schema (xsd)

Answers (2)

Related Questions