Reputation: 1734
I'm trying to do incremental parsing via iterparse()
of an XML doc that is (designed to be) too large to fit in memory. I find doing even a no-op pass over the doc exhausts process memory and causes my system to start swapping.
Is it wrong to expect xml.etree.ElementTree.iterparse()
to run in constant memory independent of size of XML doc? If so, what is the recommended package for doing incremental parsing of arbitrarily long XML docs? And if not, WTF is wrong with my code?
Here is the code:
Note that I request 'start' events only (so parser doesn't try to buffer up all the body elements before returning the end tag of the document root element (<osm> in my case). and I explicitly del()
the loop variables to force them to be freed.
Thinking that the garbage collector might not be getting a chance to run since the loop doesn't yield, I added explicit call to gc.collect()
and time.sleep()
every million iterations. But it doesn't help.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
import pprint
import gc
import time
import os
import psutil
def gcStats(myProc):
# return human readable gc.stats for 3 generations
extmem = myProc.memory_info_ex()
a = "extmem: rss {:12n}, vms {:12n}, shared{:12n}, text{:12n}, lib {:12n}, data{:12n}, dirty{:12n}".format(
extmem.rss, extmem.vms, extmem.shared, extmem.text, extmem.lib, extmem.data, extmem.dirty)
return a + "\tgc enabled {}, sumCount {:n}, lenGarbage {:n}".format( gc.isenabled(), sum(gc.get_count()), len(gc.garbage))
# the misbehaving code:
def count_tags(filename):
retVal = {}
iterCount = 0
sleepTime = 2.0
myProc = psutil.Process()
print("Starting: gc.isenabled() == {}\n{}".format(gc.isenabled(), gcStats( myProc)))
for event, element in ET.iterparse(filename, ('start',)):
assert event == 'start'
if iterCount % 1000000 == 0:
print('{} iterations, sleeping {} sec...'.format(iterCount, sleepTime))
time.sleep( sleepTime)
print('{}\nNow starting gc pass...'.format( gcStats( myProc)))
gcr = gc.collect()
print('gc returned {}'.format( gcr))
iterCount += 1
del element
del event
return retVal
if __name__ == "__main__":
tags = count_tags('/home/bobhy/MOOC_Data/' + 'chicago.osm')
Here is a sample of the doc. It is well-formed OSM data.
<?xml version='1.0' encoding='UTF-8'?>
<osm version="0.6" generator="Osmosis 0.43.1">
<bounds minlon="-88.50500" minlat="41.33900" maxlon="-87.06600" maxlat="42.29700" origin="http://www.openstreetmap.org/api/0.6"/>
<node id="219850" version="54" timestamp="2011-04-06T05:17:15Z" uid="207745" user="NE2" changeset="7781188" lat="41.7585879" lon="-87.9101245">
<tag k="exit_to" v="Joliet Road"/>
<tag k="highway" v="motorway_junction"/>
<tag k="ref" v="276C"/>
</node>
<node id="219851" version="47" timestamp="2011-04-06T05:18:47Z" uid="207745" user="NE2" changeset="7781188" lat="41.7593116" lon="-87.9076432">
<tag k="exit_to" v="North I-294 ; Tri-State Tollway; Wisconsin"/>
<tag k="highway" v="motorway_junction"/>
<tag k="ref" v="277A"/>
</node>
<node id="219871" version="1" timestamp="2006-04-15T00:34:03Z" uid="229" user="LA2" changeset="3725" lat="41.932278" lon="-87.9179332"/>
<node id="700724" version="14" timestamp="2009-04-13T11:21:51Z" uid="18480" user="nickvet419" changeset="485405" lat="41.7120272" lon="-88.0158606"/>
. . . and so on for 1.8 GB . . .
<relation id="3366425" version="1" timestamp="2013-12-07T21:37:35Z" uid="239998" user="Sundance" changeset="19330301">
<member type="way" ref="250651738" role="outer"/>
<member type="way" ref="250651748" role="inner"/>
<tag k="type" v="multipolygon"/>
</relation>
<relation id="3378994" version="1" timestamp="2013-12-14T22:24:26Z" uid="371121" user="AndrewSnow" changeset="19456337">
<member type="way" ref="251850076" role="outer"/>
<member type="way" ref="251850073" role="inner"/>
<member type="way" ref="251850074" role="inner"/>
<member type="way" ref="251850075" role="inner"/>
<tag k="type" v="multipolygon"/>
</relation>
<relation id="3382796" version="1" timestamp="2013-12-17T03:21:18Z" uid="567034" user="Umbugbene" changeset="19492258">
<member type="way" ref="252225400" role="outer"/>
<member type="way" ref="252225404" role="inner"/>
<tag k="type" v="multipolygon"/>
</relation>
</osm>
And here is the output:
Starting: gc.isenabled() == True
extmem: rss 9097216, vms 37199872, shared 3145728, text 3301376, lib 0, data 5820416, dirty 0 gc enabled True, sumCount 410, lenGarbage 0
0 iterations, sleeping 2.0 sec...
extmem: rss 9097216, vms 37335040, shared 3145728, text 3301376, lib 0, data 5955584, dirty 0 gc enabled True, sumCount 87, lenGarbage 0
Now starting gc pass...
gc returned 0
1000000 iterations, sleeping 2.0 sec...
extmem: rss 1234309120, vms 1262891008, shared 3280896, text 3301376, lib 0, data 1231511552, dirty 0 gc enabled True, sumCount 372, lenGarbage 0
Now starting gc pass...
gc returned 0
2000000 iterations, sleeping 2.0 sec...
extmem: rss 2495262720, vms 2524073984, shared 3280896, text 3301376, lib 0, data 2492694528, dirty 0 gc enabled True, sumCount 37, lenGarbage 0
Now starting gc pass...
gc returned 0
3000000 iterations, sleeping 2.0 sec...
extmem: rss 3781947392, vms 3812208640, shared 3280896, text 3301376, lib 0, data 3780829184, dirty 0 gc enabled True, sumCount 262, lenGarbage 0
Now starting gc pass...
gc returned 0
4000000 iterations, sleeping 2.0 sec...
extmem: rss 5067837440, vms 5096787968, shared 3280896, text 3301376, lib 0, data 5065408512, dirty 0 gc enabled True, sumCount 241, lenGarbage 0
Now starting gc pass...
gc returned 0
5000000 iterations, sleeping 2.0 sec...
extmem: rss 6345998336, vms 6375632896, shared 3063808, text 3301376, lib 0, data 6344253440, dirty 0 gc enabled True, sumCount 333, lenGarbage 0
Now starting gc pass...
gc returned 0
6000000 iterations, sleeping 2.0 sec...
extmem: rss 7266795520, vms 7665147904, shared 1060864, text 3301376, lib 0, data 7633768448, dirty 0 gc enabled True, sumCount 877, lenGarbage 0
Now starting gc pass...
I interpret the output to show process virtual memory is growing by about 1 000 B / iteration (that is, per XML tag parsed). I think the garbage collection stats do not show a monotonic increase in allocated objects, so I don't know where the memory growth is coming from. Garbage collection is indeed enabled.
Upvotes: 4
Views: 1067
Reputation: 3106
You need to explicitly clear the elements that you no longer need by calling the method element.clear()
, otherwise it will still linger in memory. This means that you probably want to listen also to the 'end'
events and to call clear()
once you reach the end of the encapsulating element where you know that you no longer need any of it's content.
Upvotes: 3
Reputation: 1734
Close reading of the doc for iterparse()
convinces me that the above is intended behavior. The doc says that it returns a full-fledged Element with no restrictions on child access, so it must be retaining the (incrementally growing) document tree in memory.
Since my problem did not need parent or child element access, just an event for each tag encountered, I was able to solve my problem nicely with xml.etree.ElementTree.XMLParser().
Upvotes: 1