lxml iterparse fills memory despite on clear

Question

I'm trying to parse xml. First iterparse works correctly, but second starts to fill memory. If remove the first iterparse, then nothing changes. Xml is valid.

def clear_element(e):
    e.clear()
    while e.getprevious() is not None:
        del e.getparent()[0]

def import_xml(request):
    f = 'file.xml'
    offers = etree.iterparse(f, events=('end',), tag='offer')
    for event, offer in offers:
        # processing
        # works correctly
        clear_element(offer)

    categories = etree.iterparse(f, events=('end',), tag='category')
    for event, category in categories:
        # using memory
        clear_element(category)

XML:


    
        name
        name
        name
          ~ 1000 categories
    
    
        
           data
           data
        
        
           data
           data
        
          ~ 450000 offers

mata · Accepted Answer

You're parsing the file twice, the first time you keep all the category tags and drop the offer tags, which for 1000 category tags doesn't take that much memory.

But the second time you only drop the category tags while keeping all 450000 offer tags, that's why building the tree will require a lot of memory.

In such a case it's better not to use the tag argument to iterparse and check for the tagname, while dropping all the unneeded tags:

def import_xml(request):
    f = 'file.xml'
    elements = etree.iterparse(f, events=('end',))
    for event, element in elements:
        if element.tag == 'offer':
            # handle offer ...
        elif element.tag == 'category':
            # handle category ...
        else:
            continue
        element.clear()
        element.getparent().remove(element)

Note: just calling element.clear() without deleting it from the parent would still leave the cleared elements in memory as part of the constructed tree. Probably the clear isn't really needed...

lxml iterparse fills memory despite on clear

Answers (2)

Related Questions