Cenoc
Cenoc

Reputation: 11672

lxml, parsing in reverse

I am parsing a large file (>9GB) and am using iterparse of lxml in Python to parse the file while clearing as I go forward. I was wondering, is there a way to parse backwards while clearing? I could see I how would implement this independently of lxml, but it would be nice to use this package.

Thank you in advance!

Upvotes: 1

Views: 2049

Answers (2)

Keith Gaughan
Keith Gaughan

Reputation: 22675

iterparse() is strictly forward-only, I'm afraid. If you want to read a tree in reverse, you'll have to read it forward, while writing it to some intermediate store (be it in memory or on disc) in some form that's easier for you to parse backwards, and then read that. I'm not aware of any stream parsers that allow XML to be parsed back-to-front.

Off the top of my head, you could use two files, one containing the data and the other an index of offsets to the records in the data file. That would make reading backwards relatively easy once it's been written.

Upvotes: 0

Carlos Henrique Cano
Carlos Henrique Cano

Reputation: 1508

Yes and no...

there is 'easy' solution for starting 'from the end' reverse. But there is a reverse iterator that goes until the end and on its way 'clear the references' and optimize the read.

Approach 1: split the file on its structure and nodes so you can parse what you only want.

Approach 2: check the 'smart' way to parse it at [1]

What I did in my case. I knew before that may data onto a 12gb file was at the last 2gb. So I use the unix command to split the file and process the last one only.

(this is a ugly hack but in MY case was simple and worked fast enough, you can use tail too but I want to archive the other files too)

--> A real python master will use file.seek() but I thought unix command were faster

Now I use the second approach [1]

[1] - http://www.ibm.com/developerworks/xml/library/x-hiperfparse/

I hope this helps you I had a hard time understanding the xml structure.

Upvotes: 1

Related Questions