Reputation: 13441

Processing XML file in python using too much RAM

I have a XML file, which is about 30MB, there are about 300000 element in it.

I use the following code to process this file.

xmldoc=xml.dom.minidom.parse("badges.xml")

csv_out=open("badge.csv","w")

for badge in xmldoc.getElementsByTagName("row"):
    some processing here
    csv_out.write(line)

The file is only 30MB, but when I run this script on my MBP (10.7, 8G RAM), it uses nearly 3GB memory. Why such simple script and such small file use so much memory?

Best Regards,

Upvotes: 2

Answers (3)

dustin999

Reputation: 283

I use lxml on very large xml files and never have any problems.

See this stackoverflow article for help installing, as I had to do this on my ubuntu system:

pip install lxml error

Upvotes: 0

Martijn Pieters

Reputation: 1121834

You'll need to switch to an iterative parser, which processes XML statements in chunks, allowing you to clear up memory in between. The DOM parser loads the whole document into memory in one go.

The standard library has both a SAX parser and ElementTree.iterparse options available for you.

Quick iterparse example:

from xml.etree.ElementTree import iterparse

with open("badge.csv","w") as csvout:
    for event, elem in iterparse("badges.xml"):
        if event == 'end' and elem.tag == 'row': # Complete row tag
            # some processing here
            csv_out.write(line)
            elem.clear()

Note the .clear() call; that frees up the element and removes it from memory.

Upvotes: 5

kindall

Reputation: 184171

DOM-type XML parsers can use a lot of memory since they load the whole document. 3GB seems a more than a little excessive for a 30MB file, so there is likely something else going on.

However, you might want to consider a SAX-style XML parser (xml.sax in Python). In this type of parser, your code sees each element (tag, text, etc.) via a callback as the parser processes it. A SAX-style parser retains no document structure; indeed, nothing but a single XML element is ever considered. For this reason it's fast and memory-efficient. It can be a pain to work with if your parsing needs are complex, but it seems like yours are pretty straightforward.

Upvotes: 0

Processing XML file in python using too much RAM

Answers (3)

Related Questions