Reputation: 734
Reading the large StackOverflow XML dump file (Posts.xml
~90 GB
) through the following approach
from xml.etree.cElementTree import iterparse
for evt, elem in iterparse("Posts.xml", events=('end',)):
if elem.tag == 'row':
user_fields = elem.attrib
cause OOM just iterating over the XML elements (without any memory allocation), even on a 128 GB RAM computer environment.
Since I did not get any info from documentation or other examples in the StackOverflow community, could you help me figure out how to work around it?
Upvotes: -1
Views: 94
Reputation: 196
Based on Daniel Haley's comments, you could try:
from lxml.etree import iterparse # replace xml to lxml
for evt, elem in iterparse("Posts.xml", events=('end',), tag="row"):
user_fields = elem.attrib
...
elem.clear()
Upvotes: 1