Reputation: 33
I'm trying to parse xml. First iterparse works correctly, but second starts to fill memory. If remove the first iterparse, then nothing changes. Xml is valid.
def clear_element(e):
e.clear()
while e.getprevious() is not None:
del e.getparent()[0]
def import_xml(request):
f = 'file.xml'
offers = etree.iterparse(f, events=('end',), tag='offer')
for event, offer in offers:
# processing
# works correctly
clear_element(offer)
categories = etree.iterparse(f, events=('end',), tag='category')
for event, category in categories:
# using memory
clear_element(category)
XML:
<shop>
<categories>
<category>name</category>
<category>name</category>
<category>name</category>
~ 1000 categories
</categories>
<offers>
<offer>
<inner_tag>data</inner_tag>
<inner_tag>data</inner_tag>
</offer>
<offer>
<inner_tag>data</inner_tag>
<inner_tag>data</inner_tag>
</offer>
~ 450000 offers
</offers>
</shop>
Upvotes: 3
Views: 2689
Reputation: 171
I was fighting with iterparse
for a while as well and now finally think I know how to use it correctly, so here are my words of wisdom on this:
When using iterparse
:
Make sure to use the cElementTree
implementation
Make sure to clear any elements that you do not need along the way. This is in particular important if you have a very complex XML with deep nested structures.
So let's assume your XML
had additional nodes like this:
<offers>
<offer>
<inner_tag>data</inner_tag>
<i2>
<i3>1000 characters of something</i3>
</i2>
<inner_tag>data</inner_tag>
</offer>
</offers>
then your code should look like this:
def import_xml(request):
f = 'file.xml'
elements = etree.iterparse(f, events=('end',))
for event, element in elements:
if element.tag == 'offer':
# handle offer ...
elif element.tag == 'category':
# handle category ...
elif element.tag != 'i2':
continue
element.clear()
This way, you will omit the complete <i2>
nodes with their contents while being able to process any other elements within <offers>
element.getparent().remove(element)
does not work in my code (AttributeError).
Upvotes: 0
Reputation: 69042
You're parsing the file twice, the first time you keep all the category
tags and drop the offer
tags, which for 1000 category
tags doesn't take that much memory.
But the second time you only drop the category
tags while keeping all 450000 offer
tags, that's why building the tree will require a lot of memory.
In such a case it's better not to use the tag
argument to iterparse
and check for the tagname, while dropping all the unneeded tags:
def import_xml(request):
f = 'file.xml'
elements = etree.iterparse(f, events=('end',))
for event, element in elements:
if element.tag == 'offer':
# handle offer ...
elif element.tag == 'category':
# handle category ...
else:
continue
element.clear()
element.getparent().remove(element)
Note: just calling element.clear()
without deleting it from the parent would still leave the cleared elements in memory as part of the constructed tree. Probably the clear
isn't really needed...
Upvotes: 3