lxml, parsing in reverse

Question

I am parsing a large file (>9GB) and am using iterparse of lxml in Python to parse the file while clearing as I go forward. I was wondering, is there a way to parse backwards while clearing? I could see I how would implement this independently of lxml, but it would be nice to use this package.

Thank you in advance!

Keith Gaughan · Accepted Answer

iterparse() is strictly forward-only, I'm afraid. If you want to read a tree in reverse, you'll have to read it forward, while writing it to some intermediate store (be it in memory or on disc) in some form that's easier for you to parse backwards, and then read that. I'm not aware of any stream parsers that allow XML to be parsed back-to-front.

Off the top of my head, you could use two files, one containing the data and the other an index of offsets to the records in the data file. That would make reading backwards relatively easy once it's been written.

lxml, parsing in reverse

Answers (2)

Related Questions