Reputation: 442
I want to parse a 1.6 GB XML file with Python (2.7.2) using lxml (3.2.0) on OS X (10.8.2). Because I had already read about potential issues with memory consumption, I already use fast_iter in it, but after the main loop, it eats up about 8 GB RAM, even it doesn't keep any data from the actual XML file.
from lxml import etree
def fast_iter(context, func, *args, **kwargs):
# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
# Author: Liza Daly
for event, elem in context:
func(elem, *args, **kwargs)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def process_element(elem):
pass
context = etree.iterparse("sachsen-latest.osm", tag="node", events=("end", ))
fast_iter(context, process_element)
I don't get, why there is such a massive leakage, because the element and the whole context is being deleted in fast_iter()
and at the moment I don't even process the XML data.
Any ideas?
Upvotes: 0
Views: 1182
Reputation: 51
The problem is with the behavior of etree.iterparse()
. You would think it only uses memory for each node
element, but it turns out it still keeps all the other elements in memory. Since you don't clear them, memory ends up blowing up later on, specially when parsing .osm (OpenStreetMaps) files and looking for nodes, but more on that later.
The solution I found was not to catch node
tags but catch all tags:
context = etree.iterparse(open(filename,'r'),events=('end',))
And then clear all the tags, but only parse the ones you are interested in:
for (event,elem) in progress.bar(context):
if elem.tag == 'node':
# do things here
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
Do keep in mind that it may delete other elements that you are interested in, so make sure to add more ifs where needed. For example (And this is .osm specific) tags
nested from nodes
if elem.tag == 'tag':
continue
if elem.tag == 'node':
for tag in elem.iterchildren():
# do stuff
The reason why memory was blowing up later is pretty interesting, .osm files are organized in a way that nodes
come first, then ways
then relations
. So your code does fine with nodes at the beginning, then memory gets filled as etree
goes through the rest of the elements.
Upvotes: 3