Reputation: 11672
I am parsing a large file (>9GB) and am using iterparse of lxml in Python to parse the file while clearing as I go forward. I was wondering, is there a way to parse backwards while clearing? I could see I how would implement this independently of lxml, but it would be nice to use this package.
Thank you in advance!
Upvotes: 1
Views: 2049
Reputation: 22675
iterparse()
is strictly forward-only, I'm afraid. If you want to read a tree in reverse, you'll have to read it forward, while writing it to some intermediate store (be it in memory or on disc) in some form that's easier for you to parse backwards, and then read that. I'm not aware of any stream parsers that allow XML to be parsed back-to-front.
Off the top of my head, you could use two files, one containing the data and the other an index of offsets to the records in the data file. That would make reading backwards relatively easy once it's been written.
Upvotes: 0
Reputation: 1508
Yes and no...
there is 'easy' solution for starting 'from the end' reverse. But there is a reverse iterator that goes until the end and on its way 'clear the references' and optimize the read.
Approach 1: split the file on its structure and nodes so you can parse what you only want.
Approach 2: check the 'smart' way to parse it at [1]
What I did in my case. I knew before that may data onto a 12gb file was at the last 2gb. So I use the unix command to split the file and process the last one only.
(this is a ugly hack but in MY case was simple and worked fast enough, you can use tail too but I want to archive the other files too)
--> A real python master will use file.seek() but I thought unix command were faster
Now I use the second approach [1]
[1] - http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
I hope this helps you I had a hard time understanding the xml structure.
Upvotes: 1