Reputation: 13441
I have a XML file, which is about 30MB, there are about 300000 element in it.
I use the following code to process this file.
xmldoc=xml.dom.minidom.parse("badges.xml")
csv_out=open("badge.csv","w")
for badge in xmldoc.getElementsByTagName("row"):
some processing here
csv_out.write(line)
The file is only 30MB, but when I run this script on my MBP (10.7, 8G RAM), it uses nearly 3GB memory. Why such simple script and such small file use so much memory?
Best Regards,
Upvotes: 2
Views: 2505
Reputation: 283
I use lxml on very large xml files and never have any problems.
See this stackoverflow article for help installing, as I had to do this on my ubuntu system:
Upvotes: 0
Reputation: 1121834
You'll need to switch to an iterative parser, which processes XML statements in chunks, allowing you to clear up memory in between. The DOM parser loads the whole document into memory in one go.
The standard library has both a SAX parser and ElementTree.iterparse options available for you.
Quick iterparse example:
from xml.etree.ElementTree import iterparse
with open("badge.csv","w") as csvout:
for event, elem in iterparse("badges.xml"):
if event == 'end' and elem.tag == 'row': # Complete row tag
# some processing here
csv_out.write(line)
elem.clear()
Note the .clear()
call; that frees up the element and removes it from memory.
Upvotes: 5
Reputation: 184171
DOM-type XML parsers can use a lot of memory since they load the whole document. 3GB seems a more than a little excessive for a 30MB file, so there is likely something else going on.
However, you might want to consider a SAX-style XML parser (xml.sax in Python). In this type of parser, your code sees each element (tag, text, etc.) via a callback as the parser processes it. A SAX-style parser retains no document structure; indeed, nothing but a single XML element is ever considered. For this reason it's fast and memory-efficient. It can be a pain to work with if your parsing needs are complex, but it seems like yours are pretty straightforward.
Upvotes: 0