Thomas Skowron
Thomas Skowron

Reputation: 442

lxml and fast_iter eating all the memory

I want to parse a 1.6 GB XML file with Python (2.7.2) using lxml (3.2.0) on OS X (10.8.2). Because I had already read about potential issues with memory consumption, I already use fast_iter in it, but after the main loop, it eats up about 8 GB RAM, even it doesn't keep any data from the actual XML file.

from lxml import etree

def fast_iter(context, func, *args, **kwargs):
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly
    for event, elem in context:
        func(elem, *args, **kwargs)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def process_element(elem):
    pass

context = etree.iterparse("sachsen-latest.osm", tag="node", events=("end", ))
fast_iter(context, process_element)

I don't get, why there is such a massive leakage, because the element and the whole context is being deleted in fast_iter() and at the moment I don't even process the XML data.

Any ideas?

Upvotes: 0

Views: 1182

Answers (1)

Miquel Llobet
Miquel Llobet

Reputation: 51

The problem is with the behavior of etree.iterparse(). You would think it only uses memory for each node element, but it turns out it still keeps all the other elements in memory. Since you don't clear them, memory ends up blowing up later on, specially when parsing .osm (OpenStreetMaps) files and looking for nodes, but more on that later.

The solution I found was not to catch node tags but catch all tags:

context = etree.iterparse(open(filename,'r'),events=('end',))

And then clear all the tags, but only parse the ones you are interested in:

for (event,elem) in progress.bar(context):
    if elem.tag == 'node':
        # do things here

    elem.clear()
    while elem.getprevious() is not None:
        del elem.getparent()[0]
del context

Do keep in mind that it may delete other elements that you are interested in, so make sure to add more ifs where needed. For example (And this is .osm specific) tags nested from nodes

if elem.tag == 'tag':
    continue
if elem.tag == 'node':
    for tag in elem.iterchildren():
        # do stuff

The reason why memory was blowing up later is pretty interesting, .osm files are organized in a way that nodes come first, then ways then relations. So your code does fine with nodes at the beginning, then memory gets filled as etree goes through the rest of the elements.

Upvotes: 3

Related Questions